Which factor determines the numbers of Numa nodes? - cpu

SITUATION
I am running a Virtual Machine(CentOS7) based on ESXi 6.5.
BIOS(NUMA topology): enabled
OS(NUMA topology): enabled
Virtual Machine OS: CentOS7
Host OS: ESXi 6.5
vCPU: 56
Sockets: 56
No matter how I change the number of vCPUs or the number of sockets, there is always ONE NUMA NODE on the VM. By reading the blog NUMA And vNUMA – Back To The Basic, I find that the numbers of NUMA node are no difference between assigning cores or sockets.
I also find some advanced settings on the documentation of ESXi 6.5, but I do NOT know to use the following arguments appropriately.
cpuid.coresPerSocket
numa.vcpu.maxPerVirtualNode
numa.autosize.once
numa.vcpu.min
numa.vcpu.followcorespersocket
WANT
What I want is that there are two NUMA nodes, or I can control the numbers of NUMA nodes.
QUESTION
Which factor determines the numbers of Numa nodes?
How to modify the numbers of Numa nodes(Detailed steps, please)?

Which factor determines the numbers of Numa nodes?
From the perspective of the hardware, the physical layout of processors (cores) and main memory modules is what determines the number of NUMA nodes in a system. A NUMA node consists of a collection of cores, some other logic and memory units, and a collection of memory modules that can be accessed by these cores with a much smaller latency compared to the other memory modules.
But from the perspective of the OS, if the OS is not NUMA-aware, by default in most (all?) systems, the BIOS will configure the system so that the physical address space is interleaved across the physical NUMA nodes. So from that perspective, the whole system would like a single NUMA node, even though there are multiple NUMA nodes physically. Even for a NUMA-aware OS, it's usually possible to enable node interleaving from BIOS or the OS. Doing that makes the whole system to be treated as a single NUMA node. In addition if the OS is running on a VM, the VM itself must be configured to enable NUMA.
How to modify the numbers of Numa nodes(Detailed steps, please)?
You'll have to ensure that ESXi 6.5 is configured to expose a virtual NUMA topology to guest operating systems. So you most probably have a configuration issue. See this and this.

Related

is it possible to a vCPU to use different CPUs from two different hardware computers

I'v searched about this but i don't seem to get fair answer.
lets say i wan't to create a vm that has a vCPU, and that vCPU must have 10 cores but i only have 2 computers with 5 cores of physical CPU for each.
is it possible to create one vCPU by relaying on these two physical CPUs to perform like regular one physical CPU?
Update 1: lets say i'm using virtualBox, and the term vCPU is referring to virtual cpu, and it's a well known term.
Update 2: i'm asking this because i'm doing a little research about dynamic provisioning in HPC clusters, and i wan't to know if the word "dynamic" really means allocating virtual cpus dynamically from different hardwares, like bare-metal servers. i don't know if i was searching in the wrong place but no one really answers this question in the docs.
Unfortunately, I have to start by saying that I completely disagree with the answer from OSGX (and I have to start with that as the rest of my answer depends on it). There are documented cases where aggregating CPU power of multiple physical systems into a single system image work great. Even about the comment regarding ScaleMP ...solutions can be ranged from "make target application slower" to "make target application very-very slow" ... - all one needs to do to invalidate that claim is to check the top-rated machines in the SPEC CPU benchmark lists to see machines using ScaleMP are in the top 5 SMPs ever built for performance on this benchmark.
Also, from computer architecture perspective, all large scale machines are essentially a collection of smaller machines with a special fabric (Xbar, Numalink, etc.) and some logic/chipset to manage cache coherence. today's standard fabrics (PCIe Switching, InfiniBand) are just as fast, if not faster, than those proprietary SMP interconnects. Will OSGX claim those SMPs are also "very-very-slow"?
The real question, as with any technology, is what are you trying to achieve. Most technologies are a good fit for one task but not the other. If you are trying to build a large machine (say, combine 16 servers, each with 24 cores, into a 384-core SMP), on-top of which you will be running small VMs, each using single digit number of vCPUs, then this kind of SSI solution would probably work very nicely as to the underlying infrastructure you are merely running a high-throughput computing (HTC) job - just like SPEC CPU is. However, if you are running a thread-parallel software that excessively uses serializing elements (barriers, locks, etc) that require intensive communication between all cores - then maybe you won't see any benefit.
As to the original question on the thread, or rather, the "Update 2" by the author:
...I'm asking this because i'm doing a little research about dynamic provisioning in HPC clusters...
Indeed, there is not a lot of technology out there that enables the creation of a single system from CPUs across a cluster. The technology mentioned earlier, from ScaleMP, does this but only at a physical server granularity (so, if you have a cluster of 100 servers and each cluster node has 24 cores, then you can "dynamically" create virtual machines of 48 cores (2 cluster nodes), 72 cores (3 cluster nodes), and so on, but you could not create a machine with 36 cores (1.5 cluster nodes), nor combine a few vacant CPUs from across different nodes - you either use all the cores from a node to combine into a virtual SMP, or none at all.
I'll use term vCPU as virtual cores and pCPU as physical cores, as it is defined by virtualbox documentation: https://www.virtualbox.org/manual/ch03.html#settings-processor
On the "Processor" tab, you can set how many virtual CPU cores the guest operating systems should see. Starting with version 3.0, VirtualBox supports symmetrical multiprocessing (SMP) and can present up to 32 virtual CPU cores to each virtual machine. You should not, however, configure virtual machines to use more CPU cores than you have available physically (real cores, no hyperthreads).
And I will try to answer your questions:
lets say i wan't to create a vm that has a vCPU, and that vCPU must have 10 cores but i only have 2 computers with 5 cores of physical CPU for each.
If you want to create virtual machine (with single OS image, SMP machine) all virtual cores should have shared memory. Two physical machines each of 5 cores have in sum 10 cores, but they have no shared memory. So, with classic virtualization software (qemu, kvm, xen, vmware, virtualbox, virtualpc) you is not able to convert two physical machine into single virtual machine.
is it possible to create that vCPU by relaying on these two physical CPUs to perform like regular one physical CPU?
No.
Regular physical machine have one or more CPU chips (sockets) and each chip has one or more cores. First PC had 1 chip with one core; there were servers with two sockets with one core in each. Later multicore chips were made, and huge servers may have 2, 4, 6 or sometimes even 8 sockets, with some number of cores per socket. Also, physical machine has RAM - dynamic computer memory, which is used to store data. Earlier multisocket systems had single memory controller, current multisocket systems have several memory controllers (MC, 1-2 per socket, every controller with 1, 2, or sometimes 3 or 4 channels of memory). Both multicore and multisocket systems allow any CPU core to access any memory, even if it is controlled by MC of other socket. And all accesses to the system memory are coherent (Memorycoherence, Cachecoherence) - any core may write to memory and any other core will see writes from first core in some defined order (according to Consistency model of the system). This is the shared memory.
"two physical" chips of two different machines (your PC and your laptop) have not connected their RAM together and don't implement in hardware any model of memory sharing and coherency. Two different computers interacts using networks (Ethernet, Wifi, .. which just sends packets) or files (store file on USB drive, disconnect from PC, connect to laptop, get the file). Both network and file sharing are not coherent and are not shared memory
i'm using virtualBox
With VirtualBox (and some other virtualization solutions) you may allocate 8 virtual cores for the virtual machine even when your physical machine has 4 cores. But VMM will just emulate that there are 8 cores, scheduling them one after one on available physical cores; so at any time only programs from 4 virtual cores will run on physical cores (https://forums.virtualbox.org/viewtopic.php?f=1&t=30404 " core i7, this is a 4 core .. I can use up to 16 VCPU on virtual Machine .. Yes, it means your host cores will be over-committed. .. The total load of all guest VCPUs will be split among the real CPUs."). In this case you will be able to start 10 core virtual machine on 5 core physical, and application which want to use 10 cores will get them. But performance of the application will be not better as with 5 real CPUs, and it will be less, because there will be "virtual CPU switching" and frequent synchronization will add extra overhead.
Update 2: i'm asking this because i'm doing a little research about dynamic provisioning
If you want to research about "dynamic provisioning", ask about it, not about "running something unknown on two PC at the same time)
in HPC clusters,
There are no single type of "HPC" or "HPC clusters". Different variants of HPC will require different solutions and implementations. Some HPC tasks needs huge amounts of memory (0.25, 0.5, 1, 2 TB) and will run only on shared-memory 4- or 8-socket machines, filled with hugest memory DIMM modules. Other HPC tasks may use GPGPU a lot. Third kind will combine thread parallelism (OpenMP) and process parallelism (MPI), so applications will use shared memory while threads of it runs on single machine, and they will send and receive packets over network to work collectively on one task while running on several (thousands) physical machines. Fourth kind of HPC may want to have 100 or 1000 TB of shared memory; but there are no SMP / NUMA machines with such amounts, so application can be written in Distributed shared memory paradigm/model (Distributed global address space DGAS, Partitioned global address space PGAS) to run on special machines or on huge clusters. Special solutions are used, and in PGAS the global shared memory of 100s TB is emulated from many computers which are connected with network. Program is written in special language or just use special library functions to access memory (list of special variants from Wikipedia: PGAS "Unified Parallel C, Coarray Fortran, Split-C, Fortress, Chapel, X10, UPC++, Global Arrays, DASH and SHMEM"). If the address or the request is in local memory, use it; if it is in memory of other machine, send packet to that machine to request data from memory. Even with fastest (100 Gbit/s) special networks with RDMA capability (network adapter may access memory of the PC without any additional software processing of incoming network packet) the difference between local memory and memory of remote computer is speed: you have higher latency of access and you have lower bandwidth when memory is remote (remote memory is slower than local memory).
If you say "vCPU must have 10 cores" we can read this as "there is application which want 10 core of shared memory system". In theory it is possible to emulate shared memory for application (and it can be possible to create virtualization solution which will use resources from several PC to create single virtual pc with more resources), but in practice this is very complex task and the result probably will has too low performance. There is commercial ScaleMP (very high cost; Wikipedia: ScaleMP "The ScaleMP hypervisor combines x86 servers to create a virtual symmetric multiprocessing system. The process is a type of hardware virtualization called virtualization for aggregation.") and there was commercial Cluster OpenMP from Intel (https://software.intel.com/sites/default/files/1b/1f/6330, https://www.hpcwire.com/2006/05/19/openmp_on_clusters-1/) to convert OpenMP programs (uses threads and shared memory) into MPI-like software with help of library and OS-based handlers of access to remote memory. Both solutions can be ranged from "make target application slower" to "make target application very-very slow" (internet search of scalemp+slow and cluster+openmp+slow), as computer network is always slower that computer memory (network has greater distance than memory - 100m vs 0.2m, network has narrow bus of 2, 4 or 8 high-speed pairs while memory has 64-72 high-speed pairs for every memory channel; network adapter will use external bus of CPU when memory is on internal interface, most data from network must be copied to the memory to become available to CPU).
and i wan't to know if the word "dynamic" really means
no one really answers this question in the docs.
If you want help from other people, show us the context or the docs you have with the task. It can be also useful to you to better understand some basic concepts from computing and from cluster computing (Did you have any CS/HPC courses?).
There are some results from internet search request like "dynamic+provisioning+in+HPC+clusters", but we can't say is it the same HPC variant as you want or not.

Is DMA aware of NUMA nodes?

Assume that we have 2 physical processor with 2 sockets connected to 2 NUMA nodes.
We also have 2 PCIe devices connected through a DMA controller to the system.
What it means when we say "the local PCIe device"? Is read/write speed different for a PCIe device when it writes to different NUMA nodes?
My answer Is CPU access asymmetric to Network card would pretty much answer your question.
PCIe devices are connected to one NUMA node directly, thus called
"local PCIe device" from the NUMA node point of view.
Yes, there is speed difference since it needs to cross NUMA nodes.

Ambari YARN container settings

I'm interested in CPU settings on Ambari, concretely, I see CPU options such as:
Percentage of physical CPU allocated for all containers on a node
Number of virtual cores
And per container:
Minimum Container Size (VCores)
Maximum Container Size (VCores)
I saw similar settings regarding the RAM and I was able to find some recommendations about it, but I found none for the case of CPU.
Concretely, I'm interested whether I should keep a number of VCores for the system (as in the case of memory), or should I use it all for containers? That is, should the Number of virtual cores be set to maximal value or not? And what to use as Percentage?
I would suggest that you keep a minimum of 2 cores per node for the operating system and a minimum of 2 gigs of memory for the same. The remaining you could safely utilise for your applications being launched on yarn. That being said, you can use it all but it might choke your system in case you run extremely cpu intensive jobs.

Is CPU access asymmetric to Network card

When we have 2 CPU on a machine, do they have symmetric access to network cards (PCI)?
Essentially, for a packet processing code, processing 14M packet per second from a network card, does that matter on which CPU it runs?
Not sure if you still need an answer, but I will post an answer anyway in case someone else might need it. And I assume you are asking about hardware topology rather than OS irq affinity problems.
Comment from Jerry is not 100% correct. While NUMA is SMP, but access to memory and PCIe resources from different NUMA nodes are not symmetric. It's symmetric as opposed to the master-slave AMP architecture, not about resource access.
NIC are typically attached to CPU via PCIe link (I assume you are talking about Ethernet/IP stuff, not some HPC interconnect like InfiniBand). PCIe links root from CPU. For example, Intel® Xeon® Processor E5-2699 v4 has 30 PCIe v3.0 links and Intel X520 QDA-1 10Gbe needs 4 or 8 PCIe v3.0 lanes to connect to the CPU.
A NIC can't be connected to two CPUs at the same time as PCIe link goes directly into the CPU. It depends on the motherboards configuration which PCIe physical slot connects to which CPU socket and it can't be easily switched since it's hardwired. The PCIe topology information should be in the datasheet, or printed on the motherboard next to the PCIe slot (e.g. CPU1_PCIE8, CPU2_PCIE4).
https://www.asus.com/us/Commercial-Servers-Workstations/ESC4000_G3S/specifications/
http://www.intel.com/content/www/us/en/embedded/products/grantley/specifications.html
Accessing NIC in the same NUMA domain is faster than across NUMA domain. Some performance number for your reference could be found http://docplayer.net/5271505-Network-function-virtualization-virtualized-bras-with-linux-and-intel-architecture.html. Figure 12-16.
In summary, always use cores with NIC within the same NUMA node if possible to gain best performance.

Hadoop : Which configuration is Good

What is good as a hadoop configuration..
A large number of small machines each with 512 MB Ram or a small number of large machines (somehting like 2Gb or 4GB Ram)
I can choose either of the two as my nodes would be VMs..
Please share your thoughts..
The bottlenecks are very dependent on the type of application you use. But in general, I would IMHO say that your assumption of memory is off. You should get fewer and faster mainstream machines. How each machine is configured depends on its role, but there is no way that a large number of 512 MB VMs would match even a few 12-24 GB mainstream servers with good networking/CPU and disk.
Standard high volume equipment is the way to go, but actual translates into this:
First get an efficient performance per dollar per machine before you go "sideways". Only going "sideways" with underpowered machines becomes much more expensive.
The cluster of inexpensive machines does not really mean "any machine" (contrary to some popular belief). The overhead of each node is really big, so adding memory, disk space and disk throughput and CPU is generally more efficient than adding the next node. This is of course only true up to the point where you are still in the "high volume hardware" category (mainstream fast servers). The last mile in clock frequency, memory and disk should be avoided.
So to answer your question, go for a few Gigabit Ethernet machines with 12 GB of RAM and a fast CPU and big fast disks. Make sure that all machines operate on a Gigabit switch.
BTW, many people recommend dual socket machines, Xeon CPUs, raided disks and 24 GB of RAM and argues that this gives the best performance/dollar for Hadoop.

Resources