how do numactl & perf change memory placement policy of child processes? - performance

I notice numactl has some strange impact on stream benchmark
More specifically,
"numactl ./stream_c.exe" reports 40% lower memory bandwidth than "./stream_c.exe".
I checked numactl source code and don't see anything special it should do if I don't give it any parameter. So I would naively expect numactl doesn't have performance impact in "numactl ./stream_c.exe", which is not true according to my experiment.
This is a two-socket server with high-core-count processors.
Using numastat, I can see that numactl command cause the memory allocation to be unbalanced: the two numa nodes split the memory allocation by 80:20.
Without numactl, the memory is allocated in a much balanced way: 46:54.
I also found that this is not only a numactl issue. If I use perf to invoke stream_c.exe, the memory allocation is even more unbalanced than using numactl.
So this is more like a kernel question: how do numactl and perf change the memory placement policy for the sub-processes?
Thanks!

TL;DR: The default policy used by numactl can cause performances issues as well as the OpenMP thread binding. numactl constraints are applied to all (forked) children process.
Indeed, numactl use a predefined policy by default. This policy is can be --interleaved, --preferred, --membind, --localalloc. This policy change the behavior of the operating system page allocation when a first touch on the page is done. Here are the meaning of policies:
--interleaved: memory pages are allocated across nodes specified by a nodeset, but are allocated in a round-robin fashion;
--preferred: memory is allocated from a single preferred memory node. If sufficient memory is not available, memory can be allocated from other nodes.;
--membind: only allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes;
--localalloc: always allocate on the current node (the one performing the first touch of the memory page).
In your case, specifying an --interleaved or a --localalloc policy should give better performance. I guess that the --localalloc policy should be the best choice if threads are bound to cores (see bellow).
Moreover, the STREAM_ARRAY_SIZE macro is set to a too small value (10 Mo) by default to actually measure the performance of the RAM. Indeed, the AMD EPYC 7742 processor have a 256Mio L3 cache which is big enough to fit all data of the benchmark in it. It is probably much better to compare the results with a value bigger than the L3 cache (eg. 1024 Mio).
Finally, OpenMP threads may migrate from one numa node to another. This can drastically decrease the performance of the benchmark since when a thread move to another node, the accessed memory pages are located on a remote node and the target NUMA node can be saturated. You need to bind OpenMP threads so they cannot move using for example the following environment variables for this specific processor: OMP_NUM_THREADS=64 OMP_PROC_BIND=TRUE OMP_PLACES="{0}:64:1" assuming SMT is disabled and the core IDs are in the range 0-63. If the SMT is enabled, you should tune the OMP_PLACES variable using the command: OMP_PLACES={$(hwloc-calc --sep "},{" --intersect PU core:all.pu:0)} (which require the package hwloc to be installed on the machine). You can check the thread binding with the command hwloc-ps.
Update:
numactl impacts the NUMA attributes of all the children processes (since children process are created using fork on Linux which should copy the NUMA attributes). You can check that with the following (bash) command:
numactl --show
numactl --physcpubind=2 bash -c "(sleep 1; numactl --show; sleep 2;) & (sleep 2; numactl --show; sleep 1;); wait"
This command first show the NUMA attributes. Set it for a child bash process which launch 2 process in parallel. Each sub-child process show NUMA attributes. The result is the following on my 1-node machine:
policy: default # Initial NUMA attributes
preferred node: current
physcpubind: 0 1 2 3 4 5
cpubind: 0
nodebind: 0
membind: 0
policy: default # Sub-child 1
preferred node: current
physcpubind: 2
cpubind: 0
nodebind: 0
membind: 0
policy: default # Sub-child 2
preferred node: current
physcpubind: 2
cpubind: 0
nodebind: 0
membind: 0
Here, we can see that the --physcpubind=2 constraint is applied to the two sub-child processes. It should be the same with --membind or the NUMA policy on your multi-node machine. Thus, note that calling perf before the sub-child processes should not have any impact on the NUMA attributes.
perf should not impact directly the allocations. However, the default policy of the OS could balance the amount of allocated RAM on each NUMA node. If this is the case, the page allocations can be unbalanced, decreasing the bandwidth due to the saturation of the NUMA nodes. The OS memory page balancing between NUMA nodes is very sensitive: writing a huge file between two benchmarks can impact the second run for example (and thus impact the performance of the benchmark). This is why NUMA attributes should always be set manually for HPC benchmarks (as well as process binding).
PS: I assumed the stream benchmark has been compiled with the OpenMP support and that optimizations has been enabled too (ie. -fopenmp -O2 -march=native on GCC and Clang).

Related

Intel OpenMP library slows down memory bandwidth significantly on AMD platforms by setting KMP_AFFINITY=scatter

For memory-bound programs it is not always faster to use many threads, say the same number as the cores, since threads may compete for memory channels. Usually on a two-socket machine, less threads are better but we need to set affinity policy that distributes the threads across sockets to maximize the memory bandwidth.
Intel OpenMP claims that KMP_AFFINITY=scatter is to achieve this purpose, the opposite value "compact" is to place threads as close as possible. I have used ICC to build the Stream program for benchmarking and this claim is easily validated on Intel machines. And if OMP_PROC_BIND is set, the native OpenMP env vars like OMP_PLACES and OMP_PROC_BIND are ignored. You will get such a warning:
OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined
However, a benchmark on a newest AMD EPYC machine I obtained shows really bizarre results. KMP_AFFINITY=scatter gives the slowest memory bandwidth possible. It seems that this setting is doing exactly the opposite on AMD machines: placing threads as close as possible so that even the L3 cache at each NUMA node is not even fully utilized. And if I explicitly set OMP_PROC_BIND=spread, it is ignored by Intel OpenMP as the warning above says.
The AMD machine has two sockets, 64 physical cores per socket. I have tested using 128, 64, and 32 threads and I want them to be spread across the whole system. Using OMP_PROC_BIND=spread, Stream gives me a triad speed of 225, 290, and 300 GB/s, respectively. But once I set KMP_AFFINITY=scatter, even when OMP_PROC_BIND=spread is still present, Streams gives 264, 144, and 72 GB/s.
Notice that for 128 threads on 128 cores, setting KMP_AFFINITY=scatter gives better performance, this even further suggests that in fact all the threads are placed as close as possible, but not scattering at all.
In summary, KMP_AFFINITY=scatter displays completely opposite (in the bad way) behavior on AMD machines and it will even overwrite native OpenMP environment regardless the CPU brand. The whole situation sounds a bit fishy, since it is well known that ICC detects the CPU brand and uses the CPU dispatcher in MKL to launch the slower code on non-Intel machines. So why can't ICC simply disable KMP_AFFINITY and restore OMP_PROC_BIND if it detects a non-Intel CPU?
Is this a known issue to someone? Or someone can validate my findings?
To give more context, I am a developer of commercial computational fluid dynamics program and unfortunately we links our program with ICC OpenMP library and KMP_AFFINITY=scatter is set by default because in CFD we must solve large-scale sparse linear systems and this part is extremely memory-bound. I found that with setting KMP_AFFINITY=scatter, our program becomes 4X slower (when using 32 threads) than the actual speed the program can achieve on the AMD machine.
Update:
Now using hwloc-ps I can confirm that KMP_AFFINITY=scatter is actually doing "compact" on my AMD threadripper 3 machine. I have attached the lstopo result. I run my CFD program (built by ICC2017) with 16 threads. OPM_PROC_BIND=spread can place one thread in each CCX so that L3 cache is fully utilized. Hwloc-ps -l -t gives:
While setting KMP_AFFINITY=scatter, I got
I will try the latest ICC/Clang OpenMP runtime and see how it works.
TL;DR: Do not use KMP_AFFINITY. It is not portable. Prefer OMP_PROC_BIND (it cannot be used with KMP_AFFINITY at the same time). You can mix it with OMP_PLACES to bind threads to cores manually. Moreover, numactl should be used to control the memory channel binding or more generally NUMA effects.
Long answer:
Thread binding: OMP_PLACES can be used to bound each thread to a specific core (reducing context switches and NUMA issues). OMP_PROC_BIND and KMP_AFFINITY should theoretically do that correctly, but in practice, they fail to do so on some systems. Note that OMP_PROC_BIND and KMP_AFFINITY are exclusive option: they should not be used together (OMP_PROC_BIND is a new portable replacement of the older KMP_AFFINITY environment variable). As the topology of the core change from one machine to another, you can use the hwloc tool to get the list of the PU ids required by OMP_PLACES. More especially hwloc-calc to get the list and hwloc-ls to check the CPU topology. All threads should be bound separately so that no move is possible. You can check the binding of the threads with hwloc-ps.
NUMA effects: AMD processors are built by assembling multiple CCX connected together with a high-bandwidth connection (AMD Infinity Fabric). Because of that, AMD processors are NUMA systems. If not taken into account, NUMA effects can result in a significant drop in performance. The numactl tool is designed to control/mitigate NUMA effects: processes can be bound to memory channels using the --membind option and the memory allocation policy can be set to --interleave (or --localalloc if the process is NUMA-aware). Ideally, processes/threads should only work on data allocated and first-touched on they local memory channels. If you want to test a configuration on a given CCX you can play with --physcpubind and --cpunodebind.
My guess is that the Intel/Clang runtime does not perform a good thread binding when KMP_AFFINITY=scatter is set because of a bad PU mapping (which could come from a OS bug, a runtime bug or bad user/admin settings). Probably due to the CCX (since mainstream processors containing multiple NUMA nodes were quite rare).
On AMD processors, threads accessing memory of another CCX usually pay an additional significant cost due to data moving through the (quite-slow) Infinity Fabric interconnect and possibly due to its saturation as well as the one of memory channels. I advise you to not trust OpenMP runtime's automatic thread binding (use OMP_PROC_BIND=TRUE), to rather perform the thread/memory bindings manually and then to report bugs if needed.
Here is an example of a resulting command line so as to run your application:
numactl --localalloc OMP_PROC_BIND=TRUE OMP_PLACES="{0},{1},{2},{3},{4},{5},{6},{7}" ./app
PS: be careful about PU/core IDs and logical/physical IDs.

Methods to decrease Memory consumption in ArangoDB

We currently have a ArangoDB cluster running on version 3.0.10 for a POC with about 81 GB stored on disk and Main memory consumption of about 98 GB distributed across 5 Primary DB servers. There are about 200 million vertices and 350 million Edges. There are 3 Edge collections and 3 document collections, most of the memory(80%) is consumed due to the presence of the edges
I'm exploring methods to decrease the main memory consumption. I'm wondering if there are any methods to compress/serialize the data so that less amount of main memory is utilized.
The reason for decreasing memory is to reduce infrastructure costs, I'm willing to trade-off on speed for my use case.
Please can you let me know, if there any Methods to reduce main memory consumption for ArangoDB
It took us a while to find out that our original recommendation to set vm.overcommit_memory to 2 this is not good in all situations.
It seems that there is an issue with the bundled jemalloc memory allocator in ArangoDB with some environments.
With an vm.overcommit_memory kernel settings value of 2, the allocator had a problem of splitting existing memory mappings, which made the number of memory mappings of an arangod process grow over time. This could have led to the kernel refusing to hand out more memory to the arangod process, even if physical memory was still available. The kernel will only grant up to vm.max_map_count memory mappings to each process, which defaults to 65530 on many Linux environments.
Another issue when running jemalloc with vm.overcommit_memory set to 2 is that for some workloads the amount of memory that the Linux kernel tracks as "committed memory" also grows over time and does not decrease. So eventually the ArangoDB daemon process (arangod) may not get any more memory simply because it reaches the configured overcommit limit (physical RAM * overcommit_ratio + swap space).
So the solution here is to modify the value of vm.overcommit_memory from 2 to either 1 or 0. This will fix both of these problems.
We are still observing ever-increasing virtual memory consumption when using jemalloc with any overcommit setting, but in practice this should not cause problems.
So when adjusting the value of vm.overcommit_memory from 2 to either 0 or 1 (0 is the Linux kernel default btw.) this should improve the situation.
Another way to address the problem, which however requires compilation of ArangoDB from source, is to compile a build without jemalloc (-DUSE_JEMALLOC=Off when cmaking). I am just listing this as an alternative here for completeness. With the system's libc allocator you should see quite stable memory usage. We also tried another allocator, precisely the one from libmusl, and this also shows quite stable memory usage over time. The main problem here which makes exchanging the allocator a non-trivial issue is that jemalloc has very nice performance characteristics otherwise.
(quoting Jan Steemann as can be found on github)
Several new additions to the rocksdb storage engine were made meanwhile. We demonstrate how memory management works in rocksdb.
Many Options of the rocksdb storage engine are exposed to the outside via options.
Discussions and a research of a user led to two more options to be exposed for configuration with ArangoDB 3.7:
--rocksdb.cache-index-and-filter-blocks-with-high-priority
--rocksdb.pin-l0-filter-and-index-blocks-in-cache
--rocksdb.pin-top-level-index-and-filter

why the memory fragmentation is less than 1 in Redis

Redis support 3 memory allocator: libc, jemalloc, tcmalloc. When i do memory usage test, i find that mem_fragmentation_ratio in INFO MEMORY could be less than 1 with libc allocator. With jemalloc or tcmalloc, this value is greater or equal than 1 as it should be.
Could anyone explain why mem_fragmentation_ratio is less than 1 with libc?
Redis version:2.6.12. CentOS 6
Update:
I forgot to mention that one possible reason is that swap happens and mem_fragmentation_ratio will be < 1.
But when i do my test, i adjust swapiness, even turn swap off. The result is the same. And my redis instance actually do not cost too much memory.
Generally, you will have less fragmentation with jemalloc or tcmalloc, than with libc malloc. This is due to 4 factors:
more granular allocation classes for jemalloc and tcmalloc. It reduces internal fragmentation, especially when Redis has to allocate a lot of very small objects.
better algorithms and data structures to prevent external fragmentation (especially for jemalloc). Obviously, the gain depends on your long term memory allocation patterns.
support of "malloc size". Some allocators offer an API to return the size of allocated memory. With glibc (Linux), malloc does not have this capability, so it is emulated by explicitly adding an extra prefix to each allocated memory block. It increases internal fragmentation. With jemalloc and tcmalloc (or with the BSD libc malloc), there is no such
overhead.
jemalloc (and tcmalloc with some setting changes) can be more aggressive than glibc to release memory to the OS - but again, it depends on the allocation patterns.
Now, how is it possible to get inconsistent values for mem_fragmentation_ratio?
As stated in the INFO documentation, the mem_fragmentation_ratio value is calculated as the ratio between memory resident set size of the process (RSS, measured by the OS), and the total number of bytes allocated by Redis using the allocator.
Now, if more memory is allocated with libc (compared to jemalloc,tcmalloc), or if more memory is used by some other processes on your system during your benchmarks, Redis memory may be swapped out by the OS. It will reduce the RSS (since a part of Redis memory will not be in main memory anymore). The resulting fragmentation ratio will be less than 1.
In other words, this ratio is only relevant if you are sure Redis memory has not been swapped out by the OS (if it is not the case, you will have performance issues anyway).
Other than swap, I know 2 ways to make "memory fragmentation ratio" to be less than 1:
Have a redis instance with little or no data, but thousands of idling client connections. From my testing, it looks like redis will have to allocate about 20 KB of memory for each client connections, but most of it won't actually be used (i.e. won't appear in RSS) until later.
Have a master-slave setup with let's say 8 GB of repl-backlog-size. The 8 GB will be allocated as soon as the replication starts (on master only for version <4.0, on both master and slave otherwise), but the memory will only be used as we start writing to the master. So the ratio will be way below 1 initially, and then get closer and closer to 1 as the replication backlog get filled.

Benchmarking processor affinity impact

I'm working on a NUMA architecture, where each compute node has 2 sockets and 4 cores by socket, for a total of 8 cores by compute node, and 24GB of RAM by node. I have to proof that setting processor affinity can have a significant impact on performances.
Do you have any program to suggest that I could use as a benchmark to show the difference of impact between using processor affinity or not? I could also write a simple C test program, using MPI, or OpenMP, or pthreads, but what operation would be the best to do that test? It must be something that would take advantage of cache locality, but that also would trigger context switching (blocking operations) so process could potentially migrate to another core, or worse, to an other socket. It must run on a multiple of 8 cores.
I tried to write a program that benchmarks asymmetry in memory latency on NUMA architecture, and with the help of the StackOverflow community, I succeeded. You can get the program from my StackOverflow post.
Measuring NUMA (Non-Uniform Memory Access). No observable asymmetry. Why?
When I run my benchmark program on hardware very similar to yours, I see about a 30% performance penalty when a core is reading/writing to memory that is not in the core's NUMA node (region of affinity). The program has to read and write in a pattern that deliberately defeats caching and pre-fetching, otherwise there's no observable asymmetry.
Try ASC Sequoia benchmark -- CLOMP -- designed for measuring threading overheads.
You can just use a simple single-threaded process which writes and then repeatedly reads a modest data set. The process needs to run for a lot longer than a single time slice, obviously, and long enough for processes to migrate from one core to another, e.g. 100 seconds.
You can then run two test cases:
run 8 instances of the process without CPU affinity
$ for p in 0 1 2 3 4 5 6 7 ; do time ./my_process & ; done
run 8 instances of the process with CPU affinity
$ for p in 0 1 2 3 4 5 6 7 ; do time taskset -c $p ./my_process & ; done

How will applications be scheduled on hyper-threading enabled multi-core machines?

I'm trying to gain a better understanding of how hyper-threading enabled multi-core processors work. Let's say I have an app which can be compiled with MPI or OpenMP or MPI+OpenMP. I wonder how it will be scheduled on a CentOS 5.3 box with four Xeon X7560 # 2.27GHz processors and each processor core has Hyper-Threading enabled.
The processor is numbered from 0 to 63 in /proc/cpuinfo. For my understanding, there are FOUR 8-cores physical processors, the total PHYSICAL CORES are 32, each processor core has Hyper-Threading enabled, the total LOGICAL processors are 64.
Compiled with MPICH2
How many physical cores will be used if I run with mpirun -np 16? Does it get divided up amongst the available 16 PHYSICAL cores or 16 LOGICAL processors ( 8 PHYSICAL cores using hyper-threading)?
compiled with OpenMP
How many physical cores will be used if I set OMP_NUM_THREADS=16? Does it will use 16 LOGICAL processors ?
Compiled with MPICH2+OpenMP
How many physical cores will be used if I set OMP_NUM_THREADS=16 and run with mpirun -np 16?
Compiled with OpenMPI
OpenMPI has two runtime options
-cpu-set which specifies logical cpus allocated to the job,
-cpu-per-proc which specifies number of cpu to use for each process.
If run with mpirun -np 16 -cpu-set 0-15, will it only use 8 PHYSICAL cores ?
If run with mpirun -np 16 -cpu-set 0-31 -cpu-per-proc 2, how it will be scheduled?
Thanks
Jerry
I'd expect any sensible scheduler to prefer running threads on different physical processors if possible. Then I'd expect it to prefer different physical cores. Finally, if it must, it would start using the hyperthreaded second thread on each physical core.
Basically when threads have to share processor resources they slow down. So the optimal strategy is usually to minimise the amount of processor resource sharing. This is the right strategy for CPU bound processes and that's normally what an OS assumes it is dealing with.
I would hazard a guess that the scheduler will try to keep threads in one process on the same physical cores. So if you had sixteen threads, they would be on the smallest number of physical cores. The reason for this would be cache locality; it would be considered threads from the same process would be more likely to touch the same memory, than threads from different processes. (For example, the costs of cache line invalidation across cores is high, but that cost does not occur for logical processors in the same core).
As you can see from the other two answers the ideal scheduling policy varies depending on what activity the threads are doing.
Threads working on completely different data benefit from more separation. These threads would ideally be scheduled in separate NUMA domains and physical cores.
Threads working on the same data will benefit from cache locality, so the idea policy is to schedule them close together so they share cache.
Threads that work on the same data and experience a large amount of pipeline stalls benefit from sharing a hyperthread core. Each thread can run until it stalls, at which point the other thread can run. Threads that run without stalls are only hurt by hyperthreading and should be run on different cores.
Making the ideal scheduling decision relies on a lot of data collection and a lot of decision making. A large danger in OS design is to make the thread scheduling too smart. If the OS spends a lot of processor time trying to find the ideal place to run a thread, it's wasting time it could be using to run the thread.
So often it's more efficient to use a simplified thread scheduler and if needed, let the program specify its own policy. This is the thread affinity setting.

Resources