Can gpu use swap space when its ram is full? - memory-management

I'm doing some gpu calculation using OpenCL where I need to create a buffer with size about 5 GB. My laptop has an integrated gpu with 1.5 GB ram size. I tried to run the code and it gave the wrong result. So I guess it's because the ram of gpu is full. My question is that whether there is some "swap space"(or virtual memory) that gpu can utilize when its ram is full? I know that cpu has this mechanism. But I'm not sure for gpu.

No, it cannot (at least on most GPUs). Because the GPU uses its own memory (the RAM on your graphics card) in general.
Also OpenCL code in your kernels don't do any malloc (inside the kernel). You'll use clCreateBuffer

That would depend on the GPU and whether it had an MMU and DMA access to the host memory.
A GPU with an MMU can virtualize GPU and host memory, so that it can appear as a single address space, with the physical host memory accesses handled by DMA transfer. I would imagine that if your GPU had that capability that would already be done; in which case you problem is most probably elsewhere.

Related

Allocate more than 2 GB of dma common buffer

I am developing a driver for PCI Express in Windows environment.
I use Windows7 and Windows10 and the HW is i7-7700K, RAM: 16GBytes.
There is no problem in using up to 2GBytes buffer allocated so far.
But, more than 2GB is not allocated.
Here is the code snippet that succeeded in allocating 2GB dma common buffer.
DEVICE_DESCRIPTION dd;
RtlZeroMemory(&dd, sizeof(dd));
dd.Version = DEVICE_DESCRIPTION_VERSION;
dd.InterfaceType = InterfaceTypeUndefined;
dd.MaximumLength = 0x200000; // 2MB DMA transfer size at a time
dd.Dma32BitAddresses = FALSE;
dd.Dma64BitAddresses = TRUE;
dd.Master = TRUE;
pdx->AdapterObject = IoGetDmaAdapter(pdx->Pdo, &dd, &nMapRegisters);
pdx->vaCommonBuffer = (*pdx->AdapterObject->DmaOperations->AllocateCommonBuffer)
(pdx->AdapterObject, 0x80000000, &pdx->paCommonBuffer, FALSE);
What is the size limit for the DMA common buffer allocation and why?
If we changing length 0x80000000 (2GB) to 0xC0000000 (3GB) above causes the buffer allocation to fail.
How can I allocate up to 4GB and more than 4GB dma common buffer using AllocateCommonBuffer()?
Thank you very much for your valuable time and comments in advance.
Respectfully,
KJ
I believe that DMAv2 (which is the best Windows 7 supports) cannot do more than 4GB in a single allocation, so there is a hard limit.
But even before you hit the hard limit, you're going to start seeing nondeterministic allocation failures, because AllocateCommonBuffer gives you physically contiguous pages.
As a thought experiment: all it takes is 8 unmovable pages placed at pathologically-inconvenient locations before you can't find 2GB of contiguous pages out of 16GB of RAM. And a real system will have a lot more than 8 unmovable pages, although hopefully they aren't placed at such unfortunate intervals.
Compounding matters, DMAv2 refuses to allocate memory that straddles a 4GB boundary (for historical reasons). So of the 3,670,017 different pages that this allocation could start at, 50% of them are not even considered. At 3GB, 75% of possible allocations are not considered.
We've put a lot of work into the DMA subsystem over the years: DMAv3 is a more powerful API, with fewer weird quirks. Its implementation in Windows 10 and later has some fragmentation-resistance features, and the kernel's memory manager is better at moving pages. That doesn't eliminate the fundamental problem of fragmentation; but it does make it statistically less likely.
Very recent versions of the OS can actually take advantage of the IOMMU (if available and enabled) to further mitigate the problem: you don't need the physical pages to be contiguous anymore, if we can just use the IOMMU to make them seem contiguous to the device. (An MMU is exactly why usermode apps don't need to worry about physical memory fragmentation when they allocate massive buffers via malloc).
At the end of the day, though, you simply can't assume that the system has any 2 contiguous pages of memory for your device, especially if you need your device to operate on a wide variety of systems. I've seen production systems with oodles of RAM routinely fail to allocate 64KB of common buffer, which is "merely" 16 contiguous pages. You will need to either:
Fall back to using many, smaller allocations. They won't be contiguous between them, but you'll be more successful at allocating them.
Fall back to a smaller buffer. For example, in on my home turf of networking, a NIC can use a variety of buffer sizes, and the only visible effect of a smaller buffer might just be degraded throughput.
Ask the user to reboot the device (which is a very blunt way to defragment memory!).
Try to allocate memory once, early in boot, (before memory is fragmented), and never let go of it.
Update the OS, update to DMAv3 API, and ensure you have an IOMMU.

Does CUDA mapped memory take up GPU RAM?

For example, if I have a GPU with 2GB RAM and in my app allocate large array, like 1GB, as mapped memory (page-locked host memory that is mapped to GPU address space, allocated with cudaHostAlloc()), will the amount of available GPU memory be reduced for that 1GB of mapped memory, or will I still have (close to) 2GB as I had before allocation and use?
Mapping host memory so that it appears in the GPU address space does not consume memory from the GPU on-board memory.
You can verify this in a number of ways, such as using cudaMemGetInfo

How to use the memory of the Xilinx-FPGA Virtex5/7 as a memory mapped into the x86-CPU's address space?

Is it possible to use the memory of the Xilinx-FPGA Virtex5/7 as a memory mapped into the virtual and/or physical address space of the Intel x86_64-CPU's memory and how to do it?
As maximum, I need to use unified single address space with having of direct memory access (DMA) to the memory of FPGA from CPU (like as simple memory access to CPU-RAM).
CPU: x86_64 Intel Core i7
OS: Linux kernel 2.6
Interface connection: PCI-Express 2.0 8x
You can in theory.
You'll need to write a bunch of VHDL/Verilog to take the PCIe packets and respond to them appropriately, by controlling the address, data and control lines of the internal memory "BlockRAMs", to do its reading and writing. Treating all the BlockRAM as one massive memory is likely to have routing congestion problems I imagine though!

Kernel memory address space

I've read that, on a 32-bit system with 4GB system memory, 2GB is allocated to user mode and 2GB allocated to kernel mode. But, If I had a system with 512 MB of memory, would it be partitioned as 256 MB to user and 256 MB to kernel address space?
You are confusing physical and virtual memory. 2GB is allocated to user/system, but it is virtual memory. It is even more correct to say that they are not rather allocated but they constitute an addressing space. Initially this space is not bound to physical memory at all. When application actually needs memory (first time is at start up) physical memory is allocated and some addresses from address space are mapped to it. When memory is allocated but not used long enough or PC is running out of physical memory data can be dumped in swap file, and stay there until requested. This mapping is transparent for application and it has no idea where data currently is: on chip or on HDD. So the address space is always splitted the same way.
This is not about memory (physical or virtual), but about address space.
You can plug 16GB of physical memory into your computer and make a 100GB swapfile, but 32-bit (non-enterprise) Windows will still only see 4GB (and subtract 0.75 GB for GPU memory and such). Via PAE, it could use more, but non-enterprise versions won't do that.
On top of the actual amount of memory, there is address space, which is limited to 4GB as well. Basically it is no more and no less than the collection of "numbers" (which, in this case, are addresses) that can be represented by a 32 bit number.
Since the kernel will need memory too, there is some arbitrary line drawn, which happens to be at the 2GB boundary for 32bit Windows, but can be configured differently, too.
It has nothing to do with the amount of memory on your computer (virtual or phsyical), but it is a limiting factor of how much memory you can use within a single program instance. It is not, however, a limiting factor on the memory that several programs could use.
As far as I can tell, what you are referring to are limits of how much memory can be allocated. This is much different than how much memory the OS allocated during runtime.

CUDA Memory Allocation accessible for both host and device

I'm trying to figure out a way to allocate a block of memory that is accessible by both the host (CPU) and device (GPU). Other than using cudaHostAlloc() function to allocate page-locked memory that is accessible to both the CPU and GPU, are there any other ways of allocating such blocks of memory? Thanks in advance for your comments.
The only way for the host and the device to "share" memory is using the newer zero-copy functionality. This is available on the GT200 architecture cards and some newer laptop cards. This memory must be, as you note, allocated with cudaHostAlloc so that it is page locked. There is no alternative, and even this functionality is not available on older CUDA capable cards.
If you're just looking for an easy (possibly non-performant) way to manage host to device transfers, check out the Thrust library. It has a vector class that lets you allocate memory on the device, but read and write to it from host code as if it were on the host.
Another alternative is to write your own wrapper that manages the transfers for you.
There is no way to allocate a buffer that is accessible by both the GPU and the CPU unless you use cudaHostAlloc(). This is because not only must you allocate the pinned memory on the CPU (which you could do outside of CUDA), but also you must map the memory into the GPU's (or more specifically, the context's) virtual memory.
It's true that on a discrete GPU zero-copy does incur a bus transfer. However if your access is nicely coalesced and you only consume the data once it can still be efficient, since the alternative is to transfer the data to the device and then read it into the multiprocessors in two stages.
No there is no "Automatic Way" of uploading buffers on the GPU memory.

Resources