How to completely free the memory of milvus? - milvus

Milvus uses 3g of memory after loading the vector data. After calling collection.release, the memory is not fully released, and there is still 1.8g. How can I release it all?
I am using Milvus2.0.2.

Related

D3D11_USAGE_STAGING, what kind of GPU/CPU memory is used?

I read the D3D11 Usage page and coming from a CUDA background I'm wondering what kind of memory would a texture marked as D3D11_USAGE_STAGING be stored into.
I suppose in CUDA it would be pinned page-locked zero-copy memory. I measured the transfer time from a ID3D11Texture2D with D3D11_USAGE_STAGING to a host buffer allocated with malloc and it took almost 7 milliseconds (quite a lot in streaming/gaming) and I thought this would be the time required from GPU global memory to that memory area.
Is any of my suppositions correct? What is D3D11_USAGE_STAGING using as GPU memory?
The primary use for D3D11_USAGE_STAGING is as a way to load data into other D3D11_USAGE_DEFAULT pool resources. Another common usage is for 'readback' of a render target to CPU accessible memory. You can use CopySubResourceRegion to move data between DEFAULT and STAGING resources (discrete hardware often uses Direct Memory Access to handle the moving of data between system memory and VRAM).
This is a generalization because it depends on the architecture and driver choices, but in short::
D3D11_USAGE_STAGING means put it in system memory, and the GPU can't access it.
D3D11_USAGE_DEFAULT put it in the VRAM, the CPU can't access it. To put data into it requires copying data from a STAGING resource. You can think of UpdateSubresource as a convenience function that creates a STAGING resource, copies the data to it, copies from STAGING to DEFAULT, then releases the STAGING copy.
There is an optional feature in DirectX 11.2 or later than can be implemented by the driver if they choose to allow CPU access even to D3D11_USAGE_DEFAULT pool stuff. This depends on how they have their memory system setup (i.e. in Unified Memory Architectures, system RAM and VRAM are the same thing).
D3D11_USAGE_IMMUTABLE is basically the same as D3D11_USAGE_DEFAULT, but you are saying you are only going to initialize it once in the creation call.
D3D11_USAGE_DYNAMIC means put it in shared system memory, the CPU & GPU both need access to it. There's usually a performance hit for the GPU to read from this compared to DEFAULT, so you want to use it sparingly. It's really for stuff you generate every frame on the CPU and then need to render (such as terrain systems or procedural geometry).
Games that use content streaming typically have a set of STAGING textures they load data into in the background, and then copy it to DEFAULT textures as they become available for efficient rendering. This lets them reuse both STAGING and DEFAULT textures without the overhead of destroying/creating resources every frame.
See this somewhat dated article on Microsoft Docs: Resource Management Best Practices

Allocate 5 GB of RAM in a more compact way

I just ported some code from C/C++ to Go, it is a microservice. It works well, even faster than in C/C++. But I have a problem with memory.
When my program starts it will allocate about 4.5GB RAM and will fill it with data from disc also processing data while loading, then it will run for days(hopefully months) serving the requests from RAM. Unfortunately after the processing and placement of data in RAM is finished, extra 3.5 GB of RAM remains allocated by Go. I do not do any deallocations, only allocations, I do not think my program really uses 8 GB at any point, so I think Go just acquires extra RAM because it "feels" I might need more soon, but I will not.
I read that Go does not allow any functionality to deallocate unused RAM to return it to system. I want to run more services on the same machine, saving as much of RAM as possible, so wasting almost as much as I really use feels wrong.
So how do I keep the memory footprint compact avoiding those empty 3.5 GB being allocated by Go?
Speaking of virtual memory (See "Go Memory Management" from Povilas Versockas, and RSS vs. VSZ), don't try your program with Go 1.11.
See golang/go issue 28114: "runtime: programs compiled by 1.11 allocate an unreasonable amount of virtual memory".
Also discussed here.
Possibly related to:
CL 85888: runtime: remove non-reserved heap logic
issue 10460: runtime: 512GB memory limitation
Possible workaround: golang/go issue 28081
Most likely that is virtual memory that Go is using, not actual committed pages of RAM.
DeferPanic blog post: "Understanding Go Lang Memory Usage" (2014)

Need help mapping pre-reserved **cacheable** DMA buffer on Xilinx/ARM SoC (Zynq 7000)

I've got a Xilinx Zynq 7000-based board with a peripheral in the FPGA fabric that has DMA capability (on an AXI bus). We've developed a circuit and are running Linux on the ARM cores. We're having performance problems accessing a DMA buffer from user space after it's been filled by hardware.
Summary:
We have pre-reserved at boot time a section of DRAM for use as a large DMA buffer. We're apparently using the wrong APIs to map this buffer, because it appears to be uncached, and the access speed is terrible.
Using it even as a bounce-buffer is untenably slow due to horrible performance. IIUC, ARM caches are not DMA coherent, so I would really appreciate some insight on how to do the following:
Map a region of DRAM into the kernel virtual address space but ensure that it is cacheable.
Ensure that mapping it into userspace doesn't also have an undesirable effect, even if that requires we provide an mmap call by our own driver.
Explicitly invalidate a region of physical memory from the cache hierarchy before doing a DMA, to ensure coherency.
More info:
I've been trying to research this thoroughly before asking. Unfortunately, this being an ARM SoC/FPGA, there's very little information available on this, so I have to ask the experts directly.
Since this is an SoC, a lot of stuff is hard-coded for u-boot. For instance, the kernel and a ramdisk are loaded to specific places in DRAM before handing control over to the kernel. We've taken advantage of this to reserve a 64MB section of DRAM for a DMA buffer (it does need to be that big, which is why we pre-reserve it). There isn't any worry about conflicting memory types or the kernel stomping on this memory, because the boot parameters tell the kernel what region of DRAM it has control over.
Initially, we tried to map this physical address range into kernel space using ioremap, but that appears to mark the region uncacheable, and the access speed is horrible, even if we try to use memcpy to make it a bounce buffer. We use /dev/mem to map this also into userspace, and I've timed memcpy as being around 70MB/sec.
Based on a fair amount of searching on this topic, it appears that although half the people out there want to use ioremap like this (which is probably where we got the idea from), ioremap is not supposed to be used for this purpose and that there are DMA-related APIs that should be used instead. Unfortunately, it appears that DMA buffer allocation is totally dynamic, and I haven't figured out how to tell it, "here's a physical address already allocated -- use that."
One document I looked at is this one, but it's way too x86 and PC-centric:
https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
And this question also comes up at the top of my searches, but there's no real answer:
get the physical address of a buffer under Linux
Looking at the standard calls, dma_set_mask_and_coherent and family won't take a pre-defined address and wants a device structure for PCI. I don't have such a structure, because this is an ARM SoC without PCI. I could manually populate such a structure, but that smells to me like abusing the API, not using it as intended.
BTW: This is a ring buffer, where we DMA data blocks into different offsets, but we align to cache line boundaries, so there is no risk of false sharing.
Thank you a million for any help you can provide!
UPDATE: It appears that there's no such thing as a cacheable DMA buffer on ARM if you do it the normal way. Maybe if I don't make the ioremap call, the region won't be marked as uncacheable, but then I have to figure out how to do cache management on ARM, which I can't figure out. One of the problems is that memcpy in userspace appears to really suck. Is there a memcpy implementation that's optimized for uncached memory I can use? Maybe I could write one. I have to figure out if this processor has Neon.
Have you tried implementing your own char device with an mmap() method remapping your buffer as cacheable (by means of remap_pfn_range())?
I believe you need a driver that implements mmap() if you want the mapping to be cached.
We use two device drivers for this: portalmem and zynqportal. In the Connectal Project, we call the connection between user space software and FPGA logic a "portal". These drivers require dma-buf, which has been stable for us since Linux kernel version 3.8.x.
The portalmem driver provides an ioctl to allocate a reference-counted chunk of memory and returns a file descriptor associated with that memory. This driver implements dma-buf sharing. It also implements mmap() so that user-space applications can access the memory.
At allocation time, the application may choose cached or uncached mapping of the memory. On x86, the mapping is always cached. Our implementation of mmap() currently starts at line 173 of the portalmem driver. If the mapping is uncached, it modifies vma->vm_page_prot using pgprot_writecombine(), enabling buffering of writes but disabling caching.
The portalmem driver also provides an ioctl to invalidate and optionally write back data cache lines.
The portalmem driver has no knowledge of the FPGA. For that, we the zynqportal driver, which provides an ioctl for transferring a translation table to the FPGA so that we can use logically contiguous addresses on the FPGA and translate them to the actual DMA addresses. The allocation scheme used by portalmem is designed to produce compact translation tables.
We use the same portalmem driver with pcieportal for PCI Express attached FPGAs, with no change to the user software.
The Zynq has neon instructions, and an assembly code implementation of memcpy using neon instructions, using aligned on cache boundary (32 bytes) will achieve 300 MB/s rates or higher.
I struggled with this for some time with udmabuf and discovered the answer was as simple as adding dma_coherent; to its entry in the device tree. I saw a dramatic speedup in access time from this simple step - though I still need to add code to invalidate/flush whenever I transfer ownership from/to the device.

In a GC environment, when will Core Data release its allocated memory?

In my application, right now it seems that Core Data is busy allocating space in memory for different objects, however, it's never releasing that memory. The memory used by the application keeps growing the more it runs.
Is there a call to the Core Data context (or something else) that ensures all memory is cleaned up? When will Core Data release the allocated memory?
Thanks!
Even when core data has finished with an object (which might not be when you think), the garbage collector won't necessarily collect it straight away.
The garbage collector has two methods to trigger collection: collectIfNeeded and collectExhaustively. The former doesn't guarantee to collect right now and the latter will probably stop your application for a bit.
You can force core data to fault its objects. See Reducing Memory Overhead for details.

CUDA Memory Allocation accessible for both host and device

I'm trying to figure out a way to allocate a block of memory that is accessible by both the host (CPU) and device (GPU). Other than using cudaHostAlloc() function to allocate page-locked memory that is accessible to both the CPU and GPU, are there any other ways of allocating such blocks of memory? Thanks in advance for your comments.
The only way for the host and the device to "share" memory is using the newer zero-copy functionality. This is available on the GT200 architecture cards and some newer laptop cards. This memory must be, as you note, allocated with cudaHostAlloc so that it is page locked. There is no alternative, and even this functionality is not available on older CUDA capable cards.
If you're just looking for an easy (possibly non-performant) way to manage host to device transfers, check out the Thrust library. It has a vector class that lets you allocate memory on the device, but read and write to it from host code as if it were on the host.
Another alternative is to write your own wrapper that manages the transfers for you.
There is no way to allocate a buffer that is accessible by both the GPU and the CPU unless you use cudaHostAlloc(). This is because not only must you allocate the pinned memory on the CPU (which you could do outside of CUDA), but also you must map the memory into the GPU's (or more specifically, the context's) virtual memory.
It's true that on a discrete GPU zero-copy does incur a bus transfer. However if your access is nicely coalesced and you only consume the data once it can still be efficient, since the alternative is to transfer the data to the device and then read it into the multiprocessors in two stages.
No there is no "Automatic Way" of uploading buffers on the GPU memory.

Resources