How to apply Unified Memory to existing aligned host memory - memory-management

I'm involved in effort integrating CUDA into some existing software. The software I'm integrating into is pseudo real-time, so it has a memory manager library that manually passes pointers from a single large memory allocation that is allocated up front. CUDA's Unified Memory is attractive to us, since in theory we'd theoretically be able to change this large memory chunk to Unified Memory, have the existing CPU code still work, and allow us to add GPU kernels with very little changes to the existing data I/O stream.
Parts of our existing CPU processing code requires memory to be aligned to certain alignment. cudaMallocManaged() does not allow me to specify the alignment for memory, and I feel like having to copy between "managed" and strict CPU buffers for these CPU sections almost defeats the purpose of UM. Is there a known way to address this issue that I'm missing?
I found this link on Stack Overflow that seems to solve it in theory, but I've been unable to produce good results with this method. Using CUDA 9.1, Tesla M40 (24GB):
#include <stdio.h>
#include <malloc.h>
#include <cuda.h>
#define USE_HOST_REGISTER 1
int main (int argc, char **argv)
{
int num_float = 10;
int num_bytes = num_float * sizeof(float);
float *f_data = NULL;
#if (USE_HOST_REGISTER > 0)
printf(
"%s: Using memalign + cudaHostRegister..\n",
argv[0]);
f_data = (float *) memalign(32, num_bytes);
cudaHostRegister(
(void *) f_data,
num_bytes,
cudaHostRegisterDefault);
#else
printf(
"%s: Using cudaMallocManaged..\n",
argv[0]);
cudaMallocManaged(
(void **) &f_data,
num_bytes);
#endif
struct cudaPointerAttributes att;
cudaPointerGetAttributes(
&att,
f_data);
printf(
"%s: ptr is managed: %i\n",
argv[0],
att.isManaged);
fflush(stdout);
return 0;
}
When using memalign() + cudaHostRegister() (USE_HOST_REGISTER == 1), the last print statement prints 0. Device accesses via kernel launches in larger files unsurprisingly report illegal accesses.
When using cudaMallocManaged() (USE_HOST_REGISTER == 0), the last print statement prints 1 as expected.
edit: cudaHostRegister() and cudaMallocManaged() do return successful error codes for me. Left this error-checking out in my sample I shared, but I did check them during my initial integration work. Just added the code to check, and both still return CUDA_SUCCESS.
Thanks for your insights and suggestions.

There is no method currently available in CUDA to take an existing host memory allocation and convert it into a managed memory allocation.

Related

vmalloc() allocates from vm_struct list

Kernel document https://www.kernel.org/doc/gorman/html/understand/understand010.html says, that for vmalloc-ing
It searches through a linear linked list of vm_structs and returns a new struct describing the allocated region.
Does that mean vm_struct list is already created while booting up, just like kmem_cache_create and vmalloc() just adjusts the page entries? If that is the case, say if I have a 16GB RAM in x86_64 machine, the whole ZONE_NORMAL i.e
16GB - ZONE_DMA - ZONE_DMA32 - slab-memory(cache/kmalloc)
is used to create vm_struct list?
That document is fairly old. It's talking about Linux 2.5-2.6. Things have changed quite a bit with those functions from what I can tell. I'll start by talking about code from kernel 2.6.12 since that matches Gorman's explanation and is the oldest non-rc tag in the Linux kernel Github repo.
The vm_struct list that the document is referring to is called vmlist. It is created here as a struct pointer:
struct vm_struct *vmlist;
Trying to figure out if it is initialized with any structs during bootup took some deduction. The easiest way to figure it out was by looking at the function get_vmalloc_info() (edited for brevity):
if (!vmlist) {
vmi->largest_chunk = VMALLOC_TOTAL;
}
else {
vmi->largest_chunk = 0;
prev_end = VMALLOC_START;
for (vma = vmlist; vma; vma = vma->next) {
unsigned long addr = (unsigned long) vma->addr;
if (addr >= VMALLOC_END)
break;
vmi->used += vma->size;
free_area_size = addr - prev_end;
if (vmi->largest_chunk < free_area_size)
vmi->largest_chunk = free_area_size;
prev_end = vma->size + addr;
}
if (VMALLOC_END - prev_end > vmi->largest_chunk)
vmi->largest_chunk = VMALLOC_END - prev_end;
}
The logic says that if the vmlist pointer is equal to NULL (!NULL), then there are no vm_structs on the list and the largest_chunk of free memory in this VMALLOC area is the entire space, hence VMALLOC_TOTAL. However, if there is something on the vmlist, then figure out the largest chunk based on the difference between the address of the current vm_struct and the end of the previous vm_struct (i.e. free_area_size = addr - prev_end).
What this tells us is that when we vmalloc, we look through the vmlist to find the absence of a vm_struct in a virtual memory area big enough to accomodate our request. Only then can it create this new vm_struct, which will now be part of the vmlist.
vmalloc will eventually call __get_vm_area(), which is where the action happens:
for (p = &vmlist; (tmp = *p) != NULL ;p = &tmp->next) {
if ((unsigned long)tmp->addr < addr) {
if((unsigned long)tmp->addr + tmp->size >= addr)
addr = ALIGN(tmp->size +
(unsigned long)tmp->addr, align);
continue;
}
if ((size + addr) < addr)
goto out;
if (size + addr <= (unsigned long)tmp->addr)
goto found;
addr = ALIGN(tmp->size + (unsigned long)tmp->addr, align);
if (addr > end - size)
goto out;
}
found:
area->next = *p;
*p = area;
By this point in the function we have already created a new vm_struct named area. This for loop just needs to find where to put the struct in the list. If the vmlist is empty, we skip the loop and immediately execute the "found" lines, making *p (the vmlist) point to our struct. Otherwise, we need to find the struct that will go after ours.
So in summary, this means that even though the vmlist pointer might be created at boot time, the list isn't necessarily populated at boot time. That is, unless there are vmalloc calls during boot or functions that explicitly add vm_structs to the list during boot as in future kernel versions (see below for kernel 6.0.9).
One further clarification for you. You asked if ZONE_NORMAL is used for the vmlist, but those are two separate memory address spaces. ZONE_NORMAL is describing physical memory whereas vm is virtual memory. There are lots of resources for explaining the difference between the two (e.g. this Stack Overflow question). The specific virtual memory address range for vmlist goes from VMALLOC_START to VMALLOC_END. In x86, those were defined as:
#define VMALLOC_START 0xffffc20000000000UL
#define VMALLOC_END 0xffffe1ffffffffffUL
For kernel version 6.0.9:
The creation of the vm_struct list is here:
static struct vm_struct *vmlist __initdata;
At this point, there is nothing on the list. But in this kernel version there are a few boot functions that may add structs to the list:
void __init vm_area_add_early(struct vm_struct *vm)
void __init vm_area_register_early(struct vm_struct *vm, size_t align)
As for vmalloc in this version, the vmlist is now only a list used during initialization. get_vm_area() now calls get_vm_area_node(), which is a NUMA ready function. From there, the logic goes deeper and is much more complicated than the linear search described above.

Trap memory accesses inside a standard executable built with MinGW

So my problem sounds like this.
I have some platform dependent code (embedded system) which writes to some MMIO locations that are hardcoded at specific addresses.
I compile this code with some management code inside a standard executable (mainly for testing) but also for simulation (because it takes longer to find basic bugs inside the actual HW platform).
To alleviate the hardcoded pointers, i just redefine them to some variables inside the memory pool. And this works really well.
The problem is that there is specific hardware behavior on some of the MMIO locations (w1c for example) which makes "correct" testing hard to impossible.
These are the solutions i thought of:
1 - Somehow redefine the accesses to those registers and try to insert some immediate function to simulate the dynamic behavior. This is not really usable since there are various ways to write to the MMIO locations (pointers and stuff).
2 - Somehow leave the addresses hardcoded and trap the illegal access through a seg fault, find the location that triggered, extract exactly where the access was made, handle and return. I am not really sure how this would work (and even if it's possible).
3 - Use some sort of emulation. This will surely work, but it will void the whole purpose of running fast and native on a standard computer.
4 - Virtualization ?? Probably will take a lot of time to implement. Not really sure if the gain is justifiable.
Does anyone have any idea if this can be accomplished without going too deep? Maybe is there a way to manipulate the compiler in some way to define a memory area for which every access will generate a callback. Not really an expert in x86/gcc stuff.
Edit: It seems that it's not really possible to do this in a platform independent way, and since it will be only windows, i will use the available API (which seems to work as expected). Found this Q here:
Is set single step trap available on win 7?
I will put the whole "simulated" register file inside a number of pages, guard them, and trigger a callback from which i will extract all the necessary info, do my stuff then continue execution.
Thanks all for responding.
I think #2 is the best approach. I routinely use approach #4, but I use it to test code that is running in the kernel, so I need a layer below the kernel to trap and emulate the accesses. Since you have already put your code into a user-mode application, #2 should be simpler.
The answers to this question may provide help in implementing #2. How to write a signal handler to catch SIGSEGV?
What you really want to do, though, is to emulate the memory access and then have the segv handler return to the instruction after the access. This sample code works on Linux. I'm not sure if the behavior it is taking advantage of is undefined, though.
#include <stdint.h>
#include <stdio.h>
#include <signal.h>
#define REG_ADDR ((volatile uint32_t *)0x12340000f000ULL)
static uint32_t read_reg(volatile uint32_t *reg_addr)
{
uint32_t r;
asm("mov (%1), %0" : "=a"(r) : "r"(reg_addr));
return r;
}
static void segv_handler(int, siginfo_t *, void *);
int main()
{
struct sigaction action = { 0, };
action.sa_sigaction = segv_handler;
action.sa_flags = SA_SIGINFO;
sigaction(SIGSEGV, &action, NULL);
// force sigsegv
uint32_t a = read_reg(REG_ADDR);
printf("after segv, a = %d\n", a);
return 0;
}
static void segv_handler(int, siginfo_t *info, void *ucontext_arg)
{
ucontext_t *ucontext = static_cast<ucontext_t *>(ucontext_arg);
ucontext->uc_mcontext.gregs[REG_RAX] = 1234;
ucontext->uc_mcontext.gregs[REG_RIP] += 2;
}
The code to read the register is written in assembly to ensure that both the destination register and the length of the instruction are known.
This is how the Windows version of prl's answer could look like:
#include <stdint.h>
#include <stdio.h>
#include <windows.h>
#define REG_ADDR ((volatile uint32_t *)0x12340000f000ULL)
static uint32_t read_reg(volatile uint32_t *reg_addr)
{
uint32_t r;
asm("mov (%1), %0" : "=a"(r) : "r"(reg_addr));
return r;
}
static LONG WINAPI segv_handler(EXCEPTION_POINTERS *);
int main()
{
SetUnhandledExceptionFilter(segv_handler);
// force sigsegv
uint32_t a = read_reg(REG_ADDR);
printf("after segv, a = %d\n", a);
return 0;
}
static LONG WINAPI segv_handler(EXCEPTION_POINTERS *ep)
{
// only handle read access violation of REG_ADDR
if (ep->ExceptionRecord->ExceptionCode != EXCEPTION_ACCESS_VIOLATION ||
ep->ExceptionRecord->ExceptionInformation[0] != 0 ||
ep->ExceptionRecord->ExceptionInformation[1] != (ULONG_PTR)REG_ADDR)
return EXCEPTION_CONTINUE_SEARCH;
ep->ContextRecord->Rax = 1234;
ep->ContextRecord->Rip += 2;
return EXCEPTION_CONTINUE_EXECUTION;
}
So, the solution (code snippet) is as follows:
First of all, i have a variable:
__attribute__ ((aligned (4096))) int g_test;
Second, inside my main function, i do the following:
AddVectoredExceptionHandler(1, VectoredHandler);
DWORD old;
VirtualProtect(&g_test, 4096, PAGE_READWRITE | PAGE_GUARD, &old);
The handler looks like this:
LONG WINAPI VectoredHandler(struct _EXCEPTION_POINTERS *ExceptionInfo)
{
static DWORD last_addr;
if (ExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_GUARD_PAGE_VIOLATION) {
last_addr = ExceptionInfo->ExceptionRecord->ExceptionInformation[1];
ExceptionInfo->ContextRecord->EFlags |= 0x100; /* Single step to trigger the next one */
return EXCEPTION_CONTINUE_EXECUTION;
}
if (ExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_SINGLE_STEP) {
DWORD old;
VirtualProtect((PVOID)(last_addr & ~PAGE_MASK), 4096, PAGE_READWRITE | PAGE_GUARD, &old);
return EXCEPTION_CONTINUE_EXECUTION;
}
return EXCEPTION_CONTINUE_SEARCH;
}
This is only a basic skeleton for the functionality. Basically I guard the page on which the variable resides, i have some linked lists in which i hold pointers to the function and values for the address in question. I check that the fault generating address is inside my list then i trigger the callback.
On first guard hit, the page protection will be disabled by the system, but i can call my PRE_WRITE callback where i can save the variable state. Because a single step is issued through the EFlags, it will be followed immediately by a single step exception (which means that the variable was written), and i can trigger a WRITE callback. All the data required for the operation is contained inside the ExceptionInformation array.
When someone tries to write to that variable:
*(int *)&g_test = 1;
A PRE_WRITE followed by a WRITE will be triggered,
When i do:
int x = *(int *)&g_test;
A READ will be issued.
In this way i can manipulate the data flow in a way that does not require modifications of the original source code.
Note: This is intended to be used as part of a test framework and any penalty hit is deemed acceptable.
For example, W1C (Write 1 to clear) operation can be accomplished:
void MYREG_hook(reg_cbk_t type)
{
/** We need to save the pre-write state
* This is safe since we are assured to be called with
* both PRE_WRITE and WRITE in the correct order
*/
static int pre;
switch (type) {
case REG_READ: /* Called pre-read */
break;
case REG_PRE_WRITE: /* Called pre-write */
pre = g_test;
break;
case REG_WRITE: /* Called after write */
g_test = pre & ~g_test; /* W1C */
break;
default:
break;
}
}
This was possible also with seg-faults on illegal addresses, but i had to issue one for each R/W, and keep track of a "virtual register file" so a bigger penalty hit. In this way i can only guard specific areas of memory or none, depending on the registered monitors.

Is there a way to force windows to cache a file?

Is there like batch command or something that will force windows to cache that file? I am trying to create a game preloader that loads certain game files into cache before starting the game. Is there any way I can do this?
updated int main code:
int main(int argc, const char** argv)
{
if(argc >= 2) for(int i = 1; argv[i]; ++i) pf("C:\\Games\World_of_Tanks\res\packages\gui.pkg"[i]);
return 0;
}
All you need to do is load the files, either using ReadFile or by memory mapping the files and touching every page (in fact, due to allocation granularity every 16th page suffices, but in theory you should be touching every page).
Memory mapping is faster and more cache-friendly, since you do not need to allocate extra memory to hold the data (which you aren't going to use for anything useful!). The OS will reuse the same physical memory for the cache and for the virtual memory that your process can see.
Several mainstream applications, including Microsoft Office and Adobe Reader do exactly that to launch faster. It's those "delayed start" services that keep your harddisk light flashing for a dozen seconds after you log in.
Do note, however, that while you can force Windows1 to cache files that way, but you cannot force it to keep the files in the cache indefinitively. If there is not enough physical RAM available, the system will throw away cache contents in order to satisfy application demands.
EDIT: Minimum working example implementation using filemapping:
#include <windows.h>
#include <cstdio>
void pf(const char* name)
{
HANDLE file = CreateFile(name, GENERIC_READ, FILE_SHARE_READ | FILE_SHARE_WRITE, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0);
if(file == INVALID_HANDLE_VALUE) { printf("couldn't open %s\n", name); return; };
unsigned int len = GetFileSize(file, 0);
HANDLE mapping = CreateFileMapping(file, 0, PAGE_READONLY, 0, 0, 0);
if(mapping == 0) { printf("couldn't map %s\n", name); return; }
const char* data = (const char*) MapViewOfFile(mapping, FILE_MAP_READ, 0, 0, 0);
if(data)
{
printf("prefetching %s... ", name);
// need volatile or need to use result - compiler will otherwise optimize out whole loop
volatile unsigned int touch = 0;
for(unsigned int i = 0; i < len; i += 4096)
touch += data[i];
}
else
printf("couldn't create view of %s\n", name);
UnmapViewOfFile(data);
CloseHandle(mapping);
CloseHandle(file);
}
int main(int argc, const char** argv)
{
if(argc >= 2) for(int i = 1; argv[i]; ++i) pf(argv[i]);
return 0;
}
The program will try to prefetch any filename given on the commandline.
The code isn't overly pretty but it works. It uses ANSI filenames, and leaks a file handle in case opening succeeds but mapping fails (but bleh... it's not really a problem, the OS will clean up after the program exits -- if that annoys you, wrap the handles in RAII). It's also limited to ca. 1.8GiB file size due to address space in a 32-bit build, otherwise limited to 4GiB due to GetFileSize, but that's also trivial to fix if you really need that big a file.
Instead of volatile one might want to return or otherwise consume the "result", but either way works (volatile does not truly have a measurable impact on performance, compared to a disk access!).
1Truth being told, you actually can't force Windows, but it incidentially always works that way unless you explicitly request unbuffered I/O.
In theory, you could force the OS to read pages into memory and even force it to keep them in RAM by locking the memory, but your working set quota (wich is very small, and you need aministrative rights to modify it) will not normally let you do this. That's a good thing though, since locking large amounts of memory is a very bad idea.

Limiting memory usage for a single process in OSX /Darwin

I am trying to modify some JNI code to limit the amount of memory that a process can consume. Here is the code that I am using to test setRlimit on linux and osx. In linux it works as expected and the buf is null.
This code sets the limit to 32 MB and then tries to malloc a 64 MB buffer, if buffer is null then setrlimit works.
#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/resource.h>
#define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \
} while (0)
int main(int argc) {
pid_t pid = getpid();
struct rlimit current;
struct rlimit *newp;
int memLimit = 32 * 1024 * 1024;
int result = getrlimit(RLIMIT_AS, &current);
if (result != 0)
errExit("Unable to get rlimit");
current.rlim_cur = memLimit;
current.rlim_max = memLimit;
result = setrlimit(RLIMIT_AS, &current);
if (result != 0)
errExit("Unable to setrlimit");
printf("Doing malloc \n");
int memSize = 64 * 1024 * 1024;
char *buf = malloc(memSize);
if (buf == NULL) {
printf("Your out of memory\n");
} else {
printf("Malloc successsful\n");
}
free(buf);
}
On linux machine this is my result
memtest]$ ./m200k
Doing malloc
Your out of memory
On osx 10.8
./m200k
Doing malloc
Malloc successsful
My question is that if this does not work on osx is there a way to acomplish this task in darwin kernel. The man pages all seem to say it will work but it does not appear to do so. I have seen that launchctl has some support for limiting memory but my goal is to add this ability in code. I tried using ulimit also but this did not work either and am pretty sure ulimit uses setrlimit to set limits. Also is there a signal I can catch when setrlimit soft or hardlimit is exceeded. I haven't been able to find one.
Bonus points if it can be accomplished in windows also.
Thanks for any advice
Update
As pointed out the RLIMIT_AS is explicitly defined in the man page but is defined as the RLIMIT_RSS, so if referring to the documentation RLIMIT_RSS and RLIMIT_AS are interchangable on OSX.
/usr/include/sys/resource.h on osx 10.8
#define RLIMIT_RSS RLIMIT_AS /* source compatibility alias */
Tested trojanfoe's excellent suggestion to use RLIMIT_DATA which is described here
The RLIMIT_DATA limit specifies the maximum amount of bytes the process
data segment can occupy. The data segment for a process is the area in which
dynamic memory is located (that is, memory allocated by malloc() in C, or in C++,
with new()). If this limit is exceeded, calls to allocate new memory will fail.
The result was the same for linux and osx and that was the malloc was successful for both.
chinshaw#osx$ ./m200k
Doing malloc
Malloc successsful
chinshaw#redhat ./m200k
Doing malloc
Malloc successsful

What useful things can I do with Visual C++ Debug CRT allocation hooks except finding reproduceable memory leaks?

Visual C++ debug runtime library features so-called allocation hooks. Works this way: you define a callback and call _CrtSetAllocHook() to set that callback. Now every time a memory allocation/deallocation/reallocation is done CRT calls that callback and passes a handful of parameters.
I successfully used an allocation hook to find a reproduceable memory leak - basically CRT reported that there was an unfreed block with allocation number N (N was the same on every program run) at program termination and so I wrote the following in my hook:
int MyAllocHook( int allocType, void* userData, size_t size, int blockType,
long requestNumber, const unsigned char* filename, int lineNumber)
{
if( requestNumber == TheNumberReported ) {
Sleep( 0 );// a line to put breakpoint on
}
return TRUE;
}
since the leak was reported with the very same allocation number every time I could just put a breakpoint inside the if-statement and wait until it was hit and then inspect the call stack.
What other useful things can I do using allocation hooks?
You could also use it to find unreproducible memory leaks:
Make a data structure where you map the allocated pointer to additional information
In the allocation hook you could query the current call stack (StackWalk function) and store the call stack in the data structure
In the de-allocation hook, remove the call stack information for that allocation
At the end of your application, loop over the data structure and report all call stacks. These are the places where memory was allocated but not freed.
The value "requestNumber" is not passed on to the function when deallocating (MS VS 2008). Without this number you cannot keep track of your allocation. However, you can peek into the heap header and extract that value from there:
Note: This is compiler dependent and may change without notice/ warning by the compiler.
// This struct is a copy of the heap header used by MS VS 2008.
// This information is prepending each allocated memory object in debug mode.
struct MsVS_CrtMemBlockHeader {
MsVS_CrtMemBlockHeader * _next;
MsVS_CrtMemBlockHeader * _prev;
char * _szFilename;
int _nLine;
int _nDataSize;
int _nBlockUse;
long _lRequest;
char _gap[4];
};
int MyAllocHook(..) { // same as in question
if(nAllocType == _HOOK_FREE) {
// requestNumber isn't passed on to the Hook on free.
// However in the heap header this value is stored.
size_t headerSize = sizeof(MsVS_CrtMemBlockHeader);
MsVS_CrtMemBlockHeader* pHead;
size_t ptr = (size_t) pvData - headerSize;
pHead = (MsVS_CrtMemBlockHeader*) (ptr);
long requestNumber = pHead->_lRequest;
// Do what you like to keep track of this allocation.
}
}
You could keep record of every allocation request then remove it once the deallocation is invoked, for instance: This could help you tracking memory leak problems that are way much worse than this to track down.
Just the first idea that comes to my mind...

Resources