There is a LimitAS which can limit the virtual memory(equivalent to ulimit -v) , but there is also a MemoryLimit(obseleted by MemoryMax in new versions), what's the difference between them? Does they serve the same purpose?
LimitAS and the other limits in systemd.exec(5) corresponds to ulimit, i. e. the setrlimit system call, and is per-process – a process can evade it by forking child processes (the children each inherit the limit, but their memory usage is counted separately). MemoryLimit and the other limits in systemd.resource-control(5) correspond to cgroup limits and apply to all processes in the control group collectively, which a process cannot escape. You almost certainly want to use those.
Related
I notice numactl has some strange impact on stream benchmark
More specifically,
"numactl ./stream_c.exe" reports 40% lower memory bandwidth than "./stream_c.exe".
I checked numactl source code and don't see anything special it should do if I don't give it any parameter. So I would naively expect numactl doesn't have performance impact in "numactl ./stream_c.exe", which is not true according to my experiment.
This is a two-socket server with high-core-count processors.
Using numastat, I can see that numactl command cause the memory allocation to be unbalanced: the two numa nodes split the memory allocation by 80:20.
Without numactl, the memory is allocated in a much balanced way: 46:54.
I also found that this is not only a numactl issue. If I use perf to invoke stream_c.exe, the memory allocation is even more unbalanced than using numactl.
So this is more like a kernel question: how do numactl and perf change the memory placement policy for the sub-processes?
Thanks!
TL;DR: The default policy used by numactl can cause performances issues as well as the OpenMP thread binding. numactl constraints are applied to all (forked) children process.
Indeed, numactl use a predefined policy by default. This policy is can be --interleaved, --preferred, --membind, --localalloc. This policy change the behavior of the operating system page allocation when a first touch on the page is done. Here are the meaning of policies:
--interleaved: memory pages are allocated across nodes specified by a nodeset, but are allocated in a round-robin fashion;
--preferred: memory is allocated from a single preferred memory node. If sufficient memory is not available, memory can be allocated from other nodes.;
--membind: only allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes;
--localalloc: always allocate on the current node (the one performing the first touch of the memory page).
In your case, specifying an --interleaved or a --localalloc policy should give better performance. I guess that the --localalloc policy should be the best choice if threads are bound to cores (see bellow).
Moreover, the STREAM_ARRAY_SIZE macro is set to a too small value (10 Mo) by default to actually measure the performance of the RAM. Indeed, the AMD EPYC 7742 processor have a 256Mio L3 cache which is big enough to fit all data of the benchmark in it. It is probably much better to compare the results with a value bigger than the L3 cache (eg. 1024 Mio).
Finally, OpenMP threads may migrate from one numa node to another. This can drastically decrease the performance of the benchmark since when a thread move to another node, the accessed memory pages are located on a remote node and the target NUMA node can be saturated. You need to bind OpenMP threads so they cannot move using for example the following environment variables for this specific processor: OMP_NUM_THREADS=64 OMP_PROC_BIND=TRUE OMP_PLACES="{0}:64:1" assuming SMT is disabled and the core IDs are in the range 0-63. If the SMT is enabled, you should tune the OMP_PLACES variable using the command: OMP_PLACES={$(hwloc-calc --sep "},{" --intersect PU core:all.pu:0)} (which require the package hwloc to be installed on the machine). You can check the thread binding with the command hwloc-ps.
Update:
numactl impacts the NUMA attributes of all the children processes (since children process are created using fork on Linux which should copy the NUMA attributes). You can check that with the following (bash) command:
numactl --show
numactl --physcpubind=2 bash -c "(sleep 1; numactl --show; sleep 2;) & (sleep 2; numactl --show; sleep 1;); wait"
This command first show the NUMA attributes. Set it for a child bash process which launch 2 process in parallel. Each sub-child process show NUMA attributes. The result is the following on my 1-node machine:
policy: default # Initial NUMA attributes
preferred node: current
physcpubind: 0 1 2 3 4 5
cpubind: 0
nodebind: 0
membind: 0
policy: default # Sub-child 1
preferred node: current
physcpubind: 2
cpubind: 0
nodebind: 0
membind: 0
policy: default # Sub-child 2
preferred node: current
physcpubind: 2
cpubind: 0
nodebind: 0
membind: 0
Here, we can see that the --physcpubind=2 constraint is applied to the two sub-child processes. It should be the same with --membind or the NUMA policy on your multi-node machine. Thus, note that calling perf before the sub-child processes should not have any impact on the NUMA attributes.
perf should not impact directly the allocations. However, the default policy of the OS could balance the amount of allocated RAM on each NUMA node. If this is the case, the page allocations can be unbalanced, decreasing the bandwidth due to the saturation of the NUMA nodes. The OS memory page balancing between NUMA nodes is very sensitive: writing a huge file between two benchmarks can impact the second run for example (and thus impact the performance of the benchmark). This is why NUMA attributes should always be set manually for HPC benchmarks (as well as process binding).
PS: I assumed the stream benchmark has been compiled with the OpenMP support and that optimizations has been enabled too (ie. -fopenmp -O2 -march=native on GCC and Clang).
We all know that runtime.GOMAXPROCS is set to CPU core number by default, what if this property has been set too large?
Will program have more context switches?
Will garbage collector be triggered more frequently?
GOMAXPROCS is set to the number of available logical CPUs by default for a reason: this gives best performance in most cases.
GOMAXPROCS only limits the number of "active" threads, if a thread's goroutine gets blocked (e.g. by a syscall), a new thread might be started. There is no direct correclation, see Number of threads used by Go runtime.
If GOMAXPROCS is greater than the number of available CPUs, then there will be more active threads than CPU cores, which means active threads have to be "multiplexed" to the available processing units, so yes, there will be more context switches if there are more active threads than cores, which is not necessarily the case.
Garbage collections are not directly related to the number of threads, so you shouldn't worry about that. Quoting from package runtime:
The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely. The runtime/debug package's SetGCPercent function allows changing this percentage at run time. See https://golang.org/pkg/runtime/debug/#SetGCPercent.
If you have more threads that don't allocate / release memory, that shouldn't effect how frequently collections are triggered.
There might be cases when setting GOMAXPROCS above the number of CPUs increases the performance of your app, but they are rare. Measure to find out if it helps in your case.
Is HEAP local to a process? In other words we have stack which is always local to a process and for each process it is seprate. Does the same apply to heap? Also, if HEAP is local, i believe HEAP size should change during runtime as we request more and more memory from CPU, so who puts a top limit on how much memory can be requested?
Heaps are indeed local to the process. Limits are placed by the operating system. Memory can also be limited by the number of bits used for addressing (i.e. 32-bits can only address 2G 4G of memory at a time).
Yes, on modern operating systems there exists a separate heap for each process. There is by the way not just a separate stack for every process, but there is a separate stack for every thread in the process. Thus a process can have quite a number of independent stacks.
But not all operating systems and not all hardware platforms offer this feature. You need a memory management unit (in hardware) for that to work. But desktop computers have that feature since... well... a while back... The 386-CPU? (leave a comment if you know better). You may though find yourself on some kind of micro processor that does not have that feature.
Anyway: The limit to the heap size is mainly limited by the operating system and the hardware. The hardware limits especially due to the limited amount of address space that it allows. For example a 32bit-CPU will not address more than 4GB (2^32). A CPU that features physical address extensions (PAE), which the current CPUs do support, can address up to 64GB, but that's done by the use of segments and one single process will not be able to make use of this feature. It will always see 4GB max.
Additionally the operating system can limit the memory as it sees fit. On Linux you can see and set limits using the ulimit command. If you are running some code not natively, but for example in an interpreter/virtual machine (such as Java, or PHP), then that environment can additionally limit the heap size.
'heap' is local to a proccess, but it is shared among threads, while stack is not, it is per-thread.
Regarding the limit, for example in linux it is set by ulimit (see manpage).
On a modern, preemptively multi-tasking OS, each process gets its own address space. The set of memory pages that it can see are separate from the pages that other processes can see. As a result, yes, each process sees its own stack and heap, because the stack and heap are just areas of memory.
On an older, cooperatively multi-tasking OS, each process shared the same address space, so the heap was effectively shared among all processes.
The heap is defined by the collection of things in it, so the heap size only changes as memory is allocated and freed. This is true regardless to how the OS is managing memory.
The top limit of how much memory can be requested is determined by the memory manager. In a machine without virtual memory, the top limit is simply how much memory is installed in the computer. With virtual memory, the top limit is defined by physical memory plus the size of the swap file on the disk.
In a Windows operating system with 2 physical x86/amd64 processors (P0 + P1), running 2 processes (A + B), each with two threads (T0 + T1), is it possible (or even common) to see the following:
P0:A:T0 running at the same time as P1:B:T0
then, after 1 (or is that 2?) context switch(es?)
P0:B:T1 running at the same time as P1:A:T1
In a nutshell, I'd like to know if - on a multiple processor machine - the operating system is free to schedule any thread from any process at any time, regardless of what other threads from other processes are already running.
EDIT:
To clarify the silly example, imagine that process A's thread A:T0 has affinity to processor P0 (and A:T1 to P1,) while process B's thread B:T0 has affinity to processor P1 (and B:T1 to to P0). It probably doesn't matter whether these processors are cores or sockets.
Is there a first-class concept of a process context switch? Perfmon shows context switches under the Thread object, but nothing under the Process object.
Yes, it is possible and it happens pretty often.The OS tries to not switch one thread between CPUs (you can make it try harder setting the threads preferred processor, or you can even lock it to single processor via affinity).Windows' process is not an execution unit by itself - from this viewpoint, its basically just a context for its threads.
EDIT (further clarifications)
There's nothing like a "process context switch". Basically, the OS scheduler assigns the threads via a (very adaptive) round-robin algorithm to any free processor/core (as the affinity allows), if the "previous" processor isn't immediately available, regardless of the processes (which means multi-threaded processes can steal much more CPU power).
This "jumping" may seem expensive, considering at least the L1 (and sometimes L2) caches are per-core (apart from different slot/package processors), but it's still cheaper than delays caused by waiting to the "right" processor and inability to do elaborate load-balancing (which the "jumping" scheme makes possible).This may not apply to the NUMA architecture, but there are much more considerations invoved (e.g. adapting all memory-allocations to be thread- and processor-bound and avoiding as much state/memory sharing as possible).
As for affinity: you can set affinity masks per-thread or per-process (which supersedes all process' threads' settings), but the OS enforces least one logical processor affiliated per thread (you never end up with a zero mask).
A process' default affinity mask is inherited from its parent process (which allows you to create single-core loaders for problematic legacy executables), and threads inherit the mask from the process they belong to.
You may not set a threads affinity to a processor outside the process' affinity, but you can further limit it.
Any thread by default, will jump between the available logical processors (especially if it yields, calls to kernel, etc), may jump even if it has its preferred processor set, but only if it has to,
but it will NOT jump to a processor outside its affinity mask (which may lead to considerable delays).
I'm not sure if the scheduler sees any difference between physical and hyper-threaded processors, but even if it doesn't (which I assume), the consequences are in most cases not of a concern, i.e. there should not be much difference between multiple threads sharing physical or logical processors if the thread count is just the same. Regardless, there are some reports of cache-thrashing in this scenario, mainly in high-performance heavily multithreaded applications, like SQL server or .NET and Java VMs, which may or may not benefit from HyperThreading turned off.
I generally agree with the previous answer, however things are more complex.
Although processes are not execution units, threads belonging to the same process should be treated differently. There're two reasons for this:
Same address space. Means - when switching the context between such threads no need to setup the address translation registers.
Threads of the same process are much more likely to access the same memory.
The (2) has a great impact on the cache state. If threads read the same memory location - they reuse the L2 cache, hence the whole things speeds up. There's however the drawback too: once a thread changes a memory location - that address is invalidated in both L2 cache and L2 cache of both processors, so that the other processor invalidates its cache too.
So there're pros and cons with running the threads of the same process simultaneously (on different processors). BTW this situation has a name: "Gang scheduling".
It's highly likely that there is a limitation on how many synchronization objects - semaphores, events, critical sections - can one process and all processes on a given machine use. What exactly is this limitation?
For windows, the per-process limit on kernel handles(semaphores, events,mutex) is 2^24.
From MSDN:
Kernel object handles are process
specific. That is, a process must
either create the object or open an
existing object to obtain a kernel
object handle. The per-process limit
on kernel handles is 2^24. However,
handles are stored in the paged pool,
so the actual number of handles you
can create is based on available
memory. The number of handles that you
can create on 32-bit Windows is
significantly lower than 2^24.
It depends on the quota that is available for the process. I think in XP it is set to 10000 per process, but it can grow. I am not sure what the upper limit is.
Just checked it again, the 10000 limit is for the GDI handles and not for Kernel objects.