When I use the MOT feature of the openGauss database,how can I manually release the numa_node node? - open-gauss

When I use the MOT feature of the openGauss database, the numa_node node is full and is not released for a long time. How can I manually release it? Is there any quick method?

The question is interesting and though there is a short answer it requires an explanation.
Short answer:
Manually close/kill sessions may free up MOT Local memory, in case there were large transactions or large inserts or updates (though usually sessions have quite small memory). The majority of MOT memory, which is the MOT Global memory – will not be affected.
A Vacuum command can help: VACUUM FULL [MOT_table1];
This is useful only when MOT table sizes are significantly reduced (possibly periodically) and are not expected to grow to their original size in the near future.
Server restart
Explanation:
openGauss MOT has a highly optimized memory management, welcome to read here about its NUMA awareness allocation and affinity and about MOT Memory Planning.
Firstly, to facilitate quick operation and make efficient use of NUMA nodes, MOT allocates a designated memory pool for rows per table and for nodes per index. Each such pool is composed of 2 MB chucks. A designated API allocates these chunks from a local NUMA node, from pages coming from all nodes or in a round-robin fashion, where each chunk is allocated on the next node. By default, pools of shared data are allocated in a round robin fashion in order to balance access, while not splitting rows between different NUMA nodes. However, thread private memory is allocated from a local node. It must also be verified that a thread always operates in the same NUMA node.
Secondly, MOT design expects data growth, thus once a memory chunk was added to the memory pool and used (insert of data) then at delete of rows will mark internally memory sections as the free and ready for re-use, and are not released back to the operating system.
A manually activated VACUUM command can optimize rows distribution within and across memory chunks, move them into densely populated, and free the remaining memory chunks to the OS.

Related

Is MAP_HUGETLB synonym of coherent memory? (when successful)

Am I correct to assume the mmap'd memory using MAP_HUGETLB|MAP_ANONYMOUS is actually 100% physically coherent? at least on the huge page size, 2MB or 1GB.
Otherwise I don't know how it could work/be performant since the TLB would need more entries...
Yes, they are. Indeed, as you point out, in case they weren't, multiple page table entries would be needed for a single huge page, which would defeat the entire purpose of having a huge page.
Here's an excerpt from Documentation/admin-guide/mm/hugetlbpage.rst:
The default for the allowed nodes--when the task has default memory
policy--is all on-line nodes with memory. Allowed nodes with
insufficient available, contiguous memory for a huge page will be
silently skipped when allocating persistent huge pages. See the
discussion below <mem_policy_and_hp_alloc> of the interaction
of task memory policy, cpusets and per node attributes with the
allocation and freeing of persistent huge pages.
The success or failure of huge page allocation depends on the amount of physically contiguous memory that is present in system at the time
of the allocation attempt. If the kernel is unable to allocate huge
pages from some nodes in a NUMA system, it will attempt to make up the
difference by allocating extra pages on other nodes with sufficient
available contiguous memory, if any.
See also: How do I allocate a DMA buffer backed by 1GB HugePages in a linux kernel module?

Variable allocation and tracking

I started searching and reading about ALDS and memory management recently after I got a doubt about memory allocation, and after a couple of days of study I learnt a lot of things about memory management but the actual doubt remains unsolved.
So the doubt is, while allocating memory to a variable, how exactly does the system know which block of memory is available and which is free, and similarly when we destruct an object or set a variable as null or when GC frees up some memory, what exactly does it do with that block of memory, as I know the actual data is never erased on deletion, that block just gets marked as free somewhere in some table, but does that table keep track of each and every bit on the memory, if yes then wouldn't that become a lot of data in itself to store?
For an example, if I declare a linked list, then a block will be allocated in heap with it's next block having null value as there is no other node to reference, now as I keep adding more nodes into it, system will keep allocating more blocks each containing reference to next one. Now these blocks can be present on random locations depending on the availability of memory at allocation time, and can only be accessed through their proceeding nodes.
So now, for any given block of memory, how the system will know if its free and has just garbage value in it, or its actually a node of some linked list.
On a modern operating system the process has a logical, linear address space. Part of that address space is reserved for the system and is common to all processes. Some of the address space may be reserved but most of the remainder is available to the process.
The address space is defined by PAGE TABLES. The structure of the page table is defined by the processor but the operating system maintains a table for each process. Memory is allocated to a process in PAGES. The smallest I am aware of is 512 bytes but the size can go up to a megabyte or even larger in some processors and some processor configurations.The size is always a power of 2.
The page table defines:
Whether an page has actually been mapped to the process
Whether the pages has a corresponding physical memory location
If so, the mapping to that physical location.
There operating system only knows about pages.
At the next level down there are memory managers. These are not part of the operating system. Memory managers manage heaps that consist of pages allocated by the operating system. The memory manage has to keep track of the heap size and what memory has been allocated within it.
Memory managers operate is a huge number of different ways. There are malloc/free implementations galore that you can link into your code to get different behaviors.

Is coalesced memory access a feature or phenomenon?

I'm current writing a smaller project in OpenCL, and I'm trying to find out what really causes memory coalescing. Every book on GPGPU programming says it's how GPGPUs should be programmed, but not why the hardware would prefer this.
So is it some special hardware component which merges data transfers? Or is it simply to better utilize the cache? Or is it something completely different?
Memory coalescing makes several different things more efficient. It is usually done before the requests hit the cache. Similar to the SIMT execution model it is a architectural trade-off. It enables GPUs to have a more efficient and very high performance memory system but also forces programmers to think carefully about their data layout.
Without coalescing either the cache needs to be able to serve a huge number of requests at the same time or memory access would take a lot longer as the different data transfers would need to be handled one at a time. This is even relevant when just checking if something is a hit or a miss.
Merging requests is rather easy to do, you just pick one transfer and then merge all requests with matching upper address bits. You just generate a single request per cycle and replay the load or store instruction until all threads have been handled.
Caches also stores consecutive bytes, 32/64/128Byte, this fits most applications well, is a good fit to modern DRAM and reduces the overhead for cache bookeeping information: The cache is organized in cachelines and each cacheline has a tag that indicates which addresses are stored in the line.
Modern DRAM uses wide interfaces and also long bursts: The memory of a GPU is typically organized in 32-bit or 64-bit wide channels with GDDR5 memory that has a burst length of 8. This means that every transaction at the DRAM interface has to fetch at least 32-bit*8=32 byte or 64-bit*8=64 byte at a time, even if just a single byte is required from these bytes. Designing data layouts that lead to coalesced requests helps to use the DRAM interface efficiently.
GPUs also have a huge number of parallel threads active at the same time and rather small cache at the same time. CPUs are often able to use their caches to reorder their memory requests to DRAM friendly patterns. The larger number of threads and smaller caches on GPUs make this "cache based coalescing" less efficient on GPUs, as the data will often not stay long enough in the cache to get merged at the cache with other requests to the same cacheline.
Despite the "random access" name on "RAM" (Random-access Memory), Double-Data-Rate #3 Random-Access Memory (DDR3-RAM) is faster at accessing consecutive positions rather than randomly.
Case in point: "CAS Latency" is the amount of time that DDR3 RAM will stall when you're accessing a new "column", as your RAM chip is literally charging up to serve the new data from another location on the chip.
EDIT: Jan Lucas argues that RAS Latency is more important in practice. See his comment for details.
There's roughly a 10ns delay whenever you switch columns. So, if you have a bunch of memory accesses, if you keep access a bunch of data 'close' to each other, then you don't invoke a CAS delay.
So if you have 20-words to access at a particular location, its more efficient to access those 20-words before moving to a new memory location (invoking a CAS delay). Otherwise, you'll have to invoke ANOTHER CAS delay to "switch back" between memory locations.
Its just around 10 nanoseconds, but that amount of time adds up over time.

What is Peak Working Set in windows task manager

I'm confused about the windows task manager memory overview.
in the general memory overview it shows "in use" 7.9gb (in my sample)
.
I've used process explorer to sum up the used memory and it shows me the following:
Since this is the nearest number to the 7.9gb of the task manager, i guess this value is shown there.
Now my question:
What is the Peak working set?
If i hoover over the column in task manager, it says:
and the microsoft help says Maximum amount of working set memory used by the process.
Is it now the effective used memory of all processes, or is it the maximum of memory which was used by all process?
The number you refer to is "Memory used by processes, drivers and the operating system" [source].
This is an easy but somewhat vague description. A somewhat similar description would be the total amount of memory that is not free, or part of the buffer cache, or part of the standby list.
It is not the maximum memory used at some time ("peak"), it's a coincidence that you have roughly the same number there. It is the presently used amount (used by "everyone", that is all programs and the OS).
The peak working set is a different thing. The working set is the amount of memory in a process (or, if you consider several processes, in all these processes) that is currently in physical memory. The peak working set is, consequently, the maximum value so far seen.
A process may allocate more memory than it actually ever commits ("uses"), and most processes will commit more memory than they have in their working set at one time. This is perfectly normal. Pages are moved in and out of working sets (and into the standby list) to assure that the computer, which has only a finite amount of memory, always has enough reserves to satisfy any memory needs.
The memory figures in question aren't actually a reliable indicator of how much memory a process is using.
A brief explanation of each of the memory relationships:
Private Bytes are what the process is allocated, also with pagefile usage.
Working Set is the non-paged Private Bytes plus memory-mapped files.
Virtual Bytes are the Working Set plus paged Private Bytes and
standby list.
In answer to your question the peak working set is the maximum amount of physical RAM that was assigned to the process in question.
~ Update ~
Available memory is defined as the sum of the standby list plus free memory. There is far more to total memory usage than the sum all process working sets. Because of this and due to memory sharing this value is not generally very useful.
The virtual size of a process is the portion of a process virtual address space that has been allocated for use. There is no relationship between this and physical memory usage.
Private bytes is the portion of a processes virtual address space that has been allocated for private use. It does not include shared memory or that used for code. There is no relationship between this value and physical memory usage either.
Working set is the amount of physical memory in use by a process. Due to memory sharing there will be some double counting in this value.
The terms mentioned above aren't really going to mean very much until you understand the basic concepts in Windows memory management. Have a look HERE for some further reading.

Benefits of reserving vs. committing+reserving memory using VirtualAlloc on large arrays

I am writing a C++ program that essentially works with very large arrays. On Windows, I am using VirtualAlloc to allocate memory to my arrays. Now I fully understand the difference between reserving and committing memory using VirutalAlloc; however, I am wondering whether there is any benefit in committing memory page-by-page to a reserved region. In particular, MSDN (http://msdn.microsoft.com/en-us/library/windows/desktop/aa366887(v=vs.85).aspx) contains the following explanation for the MEM_COMMIT option:
Actual physical pages are not allocated unless/until the virtual addresses are actually accessed.
My experiments confirm this: I can reserve and commit several GB of memory wihtout increasing memory usage of my process (as shown in Task Manager); actual memory gets allocated only when I actually access memory.
Now I saw quite a few examples arguing that one should reserve a large portion of the address space and then commit memory page-by-page (or in some larger blocks, depending on the app's logic). As explained above, however, memory does not seem to be committed before one accesses it; thus, I'm wondering whether there is any real benefit in committing memory page-by-page. In fact, committing memory page-by-page might actually slow my program down due to many system calls for actually comitting memory. If I commit the entire region at once, I pay for just one system call, but the kernel seems to be smart enough to actually allocate only memory that I actually use.
I would appreciate it if someone could explain to me which strategy is better.
The difference is that commit "backs" the memory against the page file. To give an example:
Given 2GB of physical ram and 2GB of swap (assume fixed-size swap for this purpose).
Reserve 6GB - OK.
Commit first 2GB - OK.
Commit remaining 4GB - fails.
Extend swap file to 8GB
Commit remaining 4GB - succeeds.
The reason for using MEM_COMMIT would primarily be for runtime error suppression (app stability). If you have a process that commits pages on-demand then there's always a chance that a commit along-the-way could fail if it exceeds amount of memory+swap available. When memory has been backed by the page file then you have a strong guarantee that the memory is available for use from now until the point that you release it.
There's a number of reasons to go one way or the other, and I don't think there's any perfect science to deciding which. MEM_RESERVE alone is only needed for very large sparse array scenarios, ex: multi-gigabyte array which has at most 25-33% utilization (a popular technique for accelerating hash tables, etc).
Almost everything else is gray area where you could probably go either way -- MEM_COMMIT up-front would make your own app a little more stable and essentially give it priority to physical ram over competing apps that might allocate on-demand. (if you grab the ram first then your app will be the last left standing when physical memory is exhausted) At the same time, if you're not actually using all that ram then you may end up limiting the multi-tasking potential of your client's machine or causing unnecessary wasted disk space via a growing page file.

Resources