When creating a CUDA graph memory allocation node, what's the default access on the allocating device? - memory-management

When creating a (template of a) CUDA execution graph, we can add a memory allocation node. It takes CUDA_MEM_ALLOC_NODE_PARAMS structure. That structure, in addition to a specification of the device on which you allocate, can take "access descriptors" for multiple locations/devices.
My question: What happens if you don't provide any access descriptor for the device on which the memory was allocated? Does that device have read & write access permissions to the allocated memory, by default? Or - must we also pass an access descriptor for it?

Related

Could someone help me understand VkPhysicalDeviceMemoryProperties?

I'm trying to figure it out, but I'm getting a little stuck.
The way the types and heaps are related is simple, if a bit strange. (why not just give VkMemoryHeap a VkMemoryType member?)
I think I understand what all the VkMemoryPropertyFlags mean, they seem fairly straightforward.
But what's with the VkMemoryHeap.flags member? It apparently only has one non-zero valid value, VkMemoryHeapFlagBits.VK_MEMORY_HEAP_DEVICE_LOCAL_BIT, and though that wouldn't be too odd on it's own, but there's also a VkMemoryPropertyFlagBits.VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT that could be present on the memory type of the heap.
What does the VkMemoryHeap.flags member mean and how does it relate to the VkMemoryType.flags member?
Vulkan recognizes two distinct concepts when it comes to memory. There are the actual physical pieces of RAM that the device can talk to. Then there are ways to allocate memory from one of those pools of RAM.
A heap represents a specific piece of RAM. VkMemoryHeap is the object that describes one of the available heaps of RAM that the device can talk to. There really aren't that many things that define a particular heap. Just two: the number of bytes of that RAMs storage and the storage's location relative to the Vulkan device (local vs. non-local).
A memory type is a particular means of allocating memory from a specific heap. VkMemoryType is the object that describes a particular way of allocating memory. And there are a lot more descriptive flags for how you can allocate memory from a heap.
For a more concrete example, consider a standard PC setup with a discrete GPU. The device has its own local RAM, but the discrete GPU can also access CPU memory. So a Vulkan device will have two heaps: one of them will be local, the other non-local.
However, there will usually be more than two memory types. You usually have one memory type that represents local memory, which does not have the VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT set. That means you can't map the memory; you can only access it via transfer operations from some other memory type (or from rendering operations or whatever).
But you will often have two memory types that both use the same non-local heap. They will both be VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, thus allowing mapping. However, one of them will likely have the VK_MEMORY_PROPERTY_HOST_CACHED_BIT flag set, while the other will be VK_MEMORY_PROPERTY_HOST_COHERENT_BIT. This allows you to choose whether you want cached CPU access (thus requiring an explicit flush of ranges of modified memory) or uncached CPU access.
But while they are two separate memory types, they both allocate from the same heap. Which is why VkMemoryType has an index that refers to the heap who's memory it is allocating from.
Only thing I'm not getting is how the two DEVICE_LOCAL flags interact.
Did you look at the specification? It's not exactly hiding how this works:
if propertyFlags has the VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT bit set, memory allocated with this type is the most efficient for device access. This property will only be set for memory types belonging to heaps with the VK_MEMORY_HEAP_DEVICE_LOCAL_BIT set.
Is it saying that if the memory is local then all types corresponding to that memory are local, or that they can be local?
You seem to be trying to impose the wrong meaning to these things. Just look at what the specification says and take it at face value.
PROPERTY_DEVICE_LOCAL denotes a memory type which will achieve the best device access performance. The only connection between this and MEMORY_DEVICE_LOCAL is that memory types with PROPERTY_DEVICE_LOCAL will only be associated with memory heaps that use MEMORY_DEVICE_LOCAL.
That's the only relevant meaning here.
If you want an example of when a memory heap would be device local, yet have memory types that aren't, consider a GPU that has no memory of its own. There's only one heap, which is therefore MEMORY_DEVICE_LOCAL.
However, allocating memory from that pool in a way that makes it host-visible may decrease the performance of device access to that memory. Therefore, for such hardware, the host-visible memory types for the same heap will not use PROPERTY_DEVICE_LOCAL.
Then again, other hardware doesn't lose performance from making memory host-visible. So they only have one memory type, which has all of the available properties. For Intel, their on-chip GPUs apparently have access to some level of the CPU's caches.

Difference between kmalloc and kmem_cache_alloc

What is difference between kmem_cache_alloc and kmalloc() in kernel memory allocation? which one is used when?
Kmalloc - allocates contiguous region from the physical memory. But keep in mind, allocating and free'ing memory is a lot of work.
Kmem_cache_alloc - Here, your process keeps some copies of the some pre-defined size objects pre-allocated. Say you have struct that you know you will be requiring very frequently, so instead of allocating it from the main memory (kmalloc) when you need it, you already keep multiple copies of it allocated & when you want it, it returns the address of the block already allocated (saves a lot of time). Similarly, when you free it, you don't give it back, it actually isn't free'd, it goes back to the allocated pool so that if some process again asks for it, you can return this address of the already allocated struct.
kmalloc: It uses the generic slab caches available to any kernel code. so your module will share slab cache with other components in kernel.
kmem_cache_alloc: It will allocate objects from a dedicated slab cache created by kmem_cache_create. If you specifically want a better slab cache management dedicated to your module only, that too for a specific type of objects, use kmem_cache_create followed by kmem_cache_alloc. USB/SCSI drivers use this. kmem_cache_create takes sizeof your object you want to create slab of, a name which appears in /proc/slabinfo and flags to govern behavior of your slab cache.
Ref: https://www.mail-archive.com/kernelnewbies#nl.linux.org/msg13191.html & LDD

QueryWorkingSetEx returning invalid pages when applied to shared memory

I'm creating a block of shared memory using CreateFileMapping and MapViewOfFile thus obtaining a pointer. I then apply QueryWorkingSetEx to to it, the problem is that i keep getting invalid pages in the PSAPI_WORKING_SET_EX_INFORMATION return structure. I'm on a NUMA architecture however the same thing happens on other non-NUMA machines.
If i try the exact same procedure on memory allocated with malloc and get valid results, is it possible that QueryWorkingSetEx does not support shared memory pointers?
after talking with Microsoft's support i was given the solution for this, since QueryWorkingSetEx is being called immediately after MapViewOfFile the memory address hasn't yet been touched so the pages are not yet backed by any physical memory.
The solution is to simply do a read loop over the memory address before QueryWorkingSetEx is invoked, this forces the memory manager to back up the pages with physical memory.

How to pin a shared memory segment into physical memory

I use boost::interprocess::managed_shared_memory to load a data structure in shared memory. I need to pin the shared memory segment into physical memory (for example similar to system call mlock for mapped files).
In linux, sooner or later my data structure gets swapped out of physical memory. In my case this incurs a forbidding cost for the next process accessing the structure, after it has been swapped out.
Is there any way to pin shared memory into physical memory? I am interested in any solution, even if it means that I cannot use boost::interprocess.
Using basic_managed_xsi_shared_memory (apparently available since boost 1.46), you can access the underlying shmid (from the get_shmid member) which should allow you to control the shmid using shmctl. With shmctl you can prevent the swapping of shared memory pages by applying the SHM_LOCK command to the shmid.
Other types of locking (which you refer to as 'pinning'), such as locking memory mapped files into memory, can be realized by supplying return values obtained from mapped_region's get_address and get_size member functions to the mlock command.

Is there any way to convert unshared heap memory to shared memory? Primary target *nix

I was wondering whether there is any reasonably portable way to take existing, unshared heap memory and to convert it into shared memory. The usage case is a block of memory which is too large for me to want to copy it unnecessarily (i.e. into a new shared memory segment) and which is allocated by routines out of my control. The primary target is *nix/POSIX, but I would also be interested to know if it can be done on Windows.
Many *nixe have Plan-9's procfs which allows to open read a process's memory by inspecting /proc/{pid}/mem
http://en.wikipedia.org/wiki/Procfs
You tell the other process your pid, size and the base address and it can simply read the data (or mmap the region in its own address space)
EDIT:: Apparently you can't open /proc/{pid}/mem without a prior ptrace(), so this is basically worthless.
Under most *nixes, ptrace(2) allows to attach to a process and read its memory.
The ptrace method does not work on OSX, you need more magic there:
http://www.voidzone.org/readprocessmemory-and-writeprocessmemory-on-os-x/
What you need under Windows is the function ReadProcessMemory
.
Googling "What is ReadProcessMemory for $OSNAME" seems to return comprehensive result sets.
You can try using mmap() with MAP_FIXED flag and address of your unshared memory (allocated from heap). However, if you provide mmap() with own pointer then it is constrained to be aligned and sized according to the size of memory page (may be requested by sysconf()) since mapping is only available for the whole page.

Resources