I have a set of CPUs allocated for some processes via cgroups. Will these CPUs be accessible for the process that doesn't belong to any cgroup? Maight be a stupid question but I was not able to google the answer.
All processes belong to some cgroup when cgroups are enabled. If you didn't set it explicitly, the processes would belong to root cgroup. You can check it through
# cat /proc/pid/cgroups
CPUs don't get exclusively allocated in cgroups. When you set some cpus for a cgroup - say cpu 0 & 1 on a 4 cpu machine, processes in that cgroup will only have access to cpu 0 & 1. Every other cgroup and process can access all cpus (0-3).
cpu masks are also hierarchical. You cannot remove a cpu from a parent cgroup if a child cgroup is using that cpu. Hope that helps.
Related
I notice numactl has some strange impact on stream benchmark
More specifically,
"numactl ./stream_c.exe" reports 40% lower memory bandwidth than "./stream_c.exe".
I checked numactl source code and don't see anything special it should do if I don't give it any parameter. So I would naively expect numactl doesn't have performance impact in "numactl ./stream_c.exe", which is not true according to my experiment.
This is a two-socket server with high-core-count processors.
Using numastat, I can see that numactl command cause the memory allocation to be unbalanced: the two numa nodes split the memory allocation by 80:20.
Without numactl, the memory is allocated in a much balanced way: 46:54.
I also found that this is not only a numactl issue. If I use perf to invoke stream_c.exe, the memory allocation is even more unbalanced than using numactl.
So this is more like a kernel question: how do numactl and perf change the memory placement policy for the sub-processes?
Thanks!
TL;DR: The default policy used by numactl can cause performances issues as well as the OpenMP thread binding. numactl constraints are applied to all (forked) children process.
Indeed, numactl use a predefined policy by default. This policy is can be --interleaved, --preferred, --membind, --localalloc. This policy change the behavior of the operating system page allocation when a first touch on the page is done. Here are the meaning of policies:
--interleaved: memory pages are allocated across nodes specified by a nodeset, but are allocated in a round-robin fashion;
--preferred: memory is allocated from a single preferred memory node. If sufficient memory is not available, memory can be allocated from other nodes.;
--membind: only allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes;
--localalloc: always allocate on the current node (the one performing the first touch of the memory page).
In your case, specifying an --interleaved or a --localalloc policy should give better performance. I guess that the --localalloc policy should be the best choice if threads are bound to cores (see bellow).
Moreover, the STREAM_ARRAY_SIZE macro is set to a too small value (10 Mo) by default to actually measure the performance of the RAM. Indeed, the AMD EPYC 7742 processor have a 256Mio L3 cache which is big enough to fit all data of the benchmark in it. It is probably much better to compare the results with a value bigger than the L3 cache (eg. 1024 Mio).
Finally, OpenMP threads may migrate from one numa node to another. This can drastically decrease the performance of the benchmark since when a thread move to another node, the accessed memory pages are located on a remote node and the target NUMA node can be saturated. You need to bind OpenMP threads so they cannot move using for example the following environment variables for this specific processor: OMP_NUM_THREADS=64 OMP_PROC_BIND=TRUE OMP_PLACES="{0}:64:1" assuming SMT is disabled and the core IDs are in the range 0-63. If the SMT is enabled, you should tune the OMP_PLACES variable using the command: OMP_PLACES={$(hwloc-calc --sep "},{" --intersect PU core:all.pu:0)} (which require the package hwloc to be installed on the machine). You can check the thread binding with the command hwloc-ps.
Update:
numactl impacts the NUMA attributes of all the children processes (since children process are created using fork on Linux which should copy the NUMA attributes). You can check that with the following (bash) command:
numactl --show
numactl --physcpubind=2 bash -c "(sleep 1; numactl --show; sleep 2;) & (sleep 2; numactl --show; sleep 1;); wait"
This command first show the NUMA attributes. Set it for a child bash process which launch 2 process in parallel. Each sub-child process show NUMA attributes. The result is the following on my 1-node machine:
policy: default # Initial NUMA attributes
preferred node: current
physcpubind: 0 1 2 3 4 5
cpubind: 0
nodebind: 0
membind: 0
policy: default # Sub-child 1
preferred node: current
physcpubind: 2
cpubind: 0
nodebind: 0
membind: 0
policy: default # Sub-child 2
preferred node: current
physcpubind: 2
cpubind: 0
nodebind: 0
membind: 0
Here, we can see that the --physcpubind=2 constraint is applied to the two sub-child processes. It should be the same with --membind or the NUMA policy on your multi-node machine. Thus, note that calling perf before the sub-child processes should not have any impact on the NUMA attributes.
perf should not impact directly the allocations. However, the default policy of the OS could balance the amount of allocated RAM on each NUMA node. If this is the case, the page allocations can be unbalanced, decreasing the bandwidth due to the saturation of the NUMA nodes. The OS memory page balancing between NUMA nodes is very sensitive: writing a huge file between two benchmarks can impact the second run for example (and thus impact the performance of the benchmark). This is why NUMA attributes should always be set manually for HPC benchmarks (as well as process binding).
PS: I assumed the stream benchmark has been compiled with the OpenMP support and that optimizations has been enabled too (ie. -fopenmp -O2 -march=native on GCC and Clang).
There is a LimitAS which can limit the virtual memory(equivalent to ulimit -v) , but there is also a MemoryLimit(obseleted by MemoryMax in new versions), what's the difference between them? Does they serve the same purpose?
LimitAS and the other limits in systemd.exec(5) corresponds to ulimit, i. e. the setrlimit system call, and is per-process – a process can evade it by forking child processes (the children each inherit the limit, but their memory usage is counted separately). MemoryLimit and the other limits in systemd.resource-control(5) correspond to cgroup limits and apply to all processes in the control group collectively, which a process cannot escape. You almost certainly want to use those.
I have created VM with 2GB RAM, I also create a ZRAM device of disksize 1GB and configured it as a swap device.
# sudo modprobe zram num_devices=1
# zramctl --find --size 1024M
# mkswap /dev/zram0
# swapon /dev/zram0
So I have 1 GB RAM and 1 GB Swap space (compressed RAM) in my new VM.
Now my question is, what happens when I issue following command?
zramctl --find --size 1024M
How does it tell kernel that it only has 1GB of total 2GB for normal papge allocation and rest is for ZRAM (RAM block device)? It doesn't allocate 1GB RAM memory for ZRAM block device at this point, correct?
I looked ZRAM kernel driver implementation. When ever it tries to add a page to ZRAM device. It allocates the memory (the 0-order pages) using alloc_pages(gfp_mask,0) and link them to create so called zspage.
And use this zspage to store the compressed page.
So this means ZRAM is not allocating 1GB ram during the driver initialization, correct? but allocates dynamically as it needs.
My question is when I spawn a process that need more than 1 GB memory. It should first use all available 1GB and then use ZRAM swap memory. How does the ZRAM tells the kernel that it only can use 1GB and rest of the RAM (1GB) belongs to it?
Kind Regards,
It took me a while to find info on how zRam actually behaves and functions, but I finally found my answer.
When you initialize a zRam device (/dev/zramX) no ram is actually allocated, and your system will use physical ram as normal until swapping occurs. When this happens, the zRam device starts to use physical memory to store compressed pages, up to a total ammount of uncompressed data equal to the initialized size, which means that the space used by zRam is less than the ammount specified by your command zramctl --find --size 1024M.
There are apparently hooks in either /proc or /sys to query the current state of your zRam disks, but since I just started testing/playing with it this week, I'm not too familiar with it.
Source: some guy's personal wiki?
As an abstract concept of parallel computing, Local(shared) memory is allocated per Thread Blocks (CUDA) / Workgroups (OpenCL) and shared between all threads in the same Thread Blocks (CUDA) / Workgroups (OpenCL).
How it is actually allocated ? is it allocated by the first thread of the Block/Group or it is allocated before creating the Blocks by the memory controller ? or something else ?
What OpenCL considers "Local Memory" is:
Memory available only during the kernel execution, that is shared only by elements of the same workgroup. Each workgroup can only see their local memory.
The memory usage is known at compile time and limited.
It is very similar to registers or L1/L2 cache in CPUs / multicore systems. Compilers know about the registers of the target CPU and plan accordingly.
When the scheduler schedules the workgroups to hardware resources will always ensure enough memory is in place for each workgroup.
You can consider local memory inside kernel execution as a pointer to memory that is already allocated, similar to a register or private memory.
I have a query related to memory leak.
A 32 bit Linux based system is running multiple active processes A,B,C,D. All the processes are allocating/deallocating memory from the heap. Now if process A is contionusly leaking a significant amount of memory, could it happen that after a certain amount of time process B cant find any memory to allocate from the heap?
As per my understanding, each process is provided with a unque VM of 2GB from the OS. But there is a mappig between the VM and the physical memory.
Yes, if the total amount of VM (RAM + swap space) is exhausted by process A, then malloc in any of the other processes might fail because of that. Linux hides processes' memory spaces from other processes, but it doesn't magically create extra memory in your machine. (Although it may seem to do so due to its overcommit behavior.)
In addition, Linux may employ its OOM killer when memory is running low.
Linux kernel does memory overcommit by default.
When a process request a memory segment with malloc() the memory is not automatically allocated.
You may have 4 processes malloc()ing 2gb each and not having any problem.
The problem arise when the process make use (initialize, bzero, copy) of the malloc()ed memory.
You may even malloc more memory than the system may reserve for you, without any problem, and malloc() doesn't even return NULL!!