relation of CPU speed and NIC throughput - performance

I'm testing a performance of DPDK-based OpenvSwitch implementation (github.com/01org/dpdk-ovs) on the following server:
Intel CPU Xeon E3 3.30Ghz
Intel 1G NIC I210
RAM 8G
Basically my setup includes two ports, traffic enters port0 and forwarded by dpdk to port1. The performance is quite low, although I isolated processes on distinct cores on the machine with DPDK software. I didn't do IRQ affinitization because DPDK has poll-mode user drivers.
Now I'm beginning to wonder, if a CPU speed of the server may have an impact on the overall performance? I mean with regard to NIC speed and packet processing performance, is it normal to slow down CPU, i.e. drop the frequency, in order to achieve better performance, or it sounds stupid?
Thanks.

Its very unlikely CPU drops its packet processing cycles speed with respective to NIC processing capability. Ideally when NIC working at its full capacity, meaning when its packet buffers are DMA'ed to the system memory form NIC memory doesn't require any CPU cycles. And when DPDK PMD rte_eth_rx/tx_burst() performed to receive or transmit on two different physical CPUs(In your case) or on hyper-threaded lcores, then CPU cycles are required. Hence for better performance processing at NIC capacity, you might need additional cores but it doesn't slow down the CPU.

Related

Explanation for why effective DRAM bandwidth reduces upon adding CPUs

This question is a spin-off of the one posted here: Measuring bandwidth on a ccNUMA system
I've written a micro-benchmark for the memory bandwidth on a ccNUMA system with 2x Intel(R) Xeon(R) Platinum 8168:
24 cores # 2.70 GHz,
L1 cache 32 kB, L2 cache 1 MB and L3 cache 33 MB.
As a reference, I'm using the Intel Advisor's roof-line plot, which depicts the bandwidths of each CPU data-path available. According to this, the bandwidth is 230 GB/s.
Strong scaling of bandwidth:
Question: If you look at the strong scaling diagram, you can see that the peak effective bandwidth is actually achieved at 33 CPUs, following which adding CPUs only reduces it. Why is this happening?
Overview
This answer provides probable explanations. Put it shortly, all parallel workload does not infinitely scale. When many cores compete for the same shared resource (eg. DRAM), using too many cores is often detrimental because there is a point where there are enough cores to saturate a given shared resource and using more core only increase the overheads.
More specifically, in your case, the L3 cache and the IMCs are likely the problem. Enabling Sub-NUMA Clustering and non-temporal prefetch should improve a bit the performances and the scalability of your benchmark. Still, there are other architectural hardware limitations that can cause the benchmark not to scale well. The next section describes how Intel Skylake SP processors deal with memory accesses and how to find the bottlenecks.
Under the hood
The layout of Intel Xeon Skylake SP processors is like the following in your case:
Source: Intel
There are two sockets connected with an UPI interconnect and each processor is connected to its own set of DRAM. There are 2 Integrated Memory Controller (IMC) per processor and each is connected to 3 DDR4 DRAM # 2666MHz. This means the theoretical bandwidth is 2*2*3*2666e6*8 = 256 GB/s = 238 GiB/s.
Assuming your benchmark is well designed and each processor access only to its NUMA node, I expect a very low UPI throughput and a very low number of remote NUMA pages. You can check this with hardware counters. Linux perf or VTune enable you to check this relatively easily.
The L3 cache is split in slices. All physical addresses are distributed across the cache slices using an hash function (see here for more informations). This method enable the processor to balance the throughput between all the L3 slices. This method also enable the processor to balance the throughput between the two IMCs so that in-fine the processor looks like a SMP architecture instead of a NUMA one. This was also use in Sandy Bridge and Xeon Phi processors (mainly to mitigate NUMA effects).
Hashing does not guarantee a perfect balancing though (no hash function is perfect, especially the ones that are fast to compute), but it is often quite good in practice, especially for contiguous accesses. A bad balancing decreases the memory throughput due to partial stalls. This is one reason you cannot reach the theoretical bandwidth.
With a good hash function, the balancing should be independent of the number of core used. If the hash function is not good enough, one IMC can be more saturated than the other one oscillating over time. The bad news is that the hash function is undocumented and checking this behaviour is complex: AFAIK you can get hardware counters for the each IMC throughput but they have a limited granularity which is quite big. On my Skylake machine the name of the hardware counters are uncore_imc/data_reads/ and uncore_imc/data_writes/ but on your platform you certainly have 4 counters for that (one for each IMC).
Fortunately, Intel provides a feature called Sub-NUMA Clustering (SNC) on Xeon SP processors like your. The idea is to split the processor in two NUMA nodes that have their own dedicated IMC. This solve the balancing issue due to the hash function and so result in faster memory operations as long as your application is NUMA-friendly. Otherwise, it can actually be significantly slower due to NUMA effects. In the worst case, the pages of an application can all be mapped to the same NUMA node resulting in only half the bandwidth being usable. Since your benchmark is supposed to be NUMA-friendly, SNC should be more efficient.
Source: Intel
Furthermore, having more cores accessing the L3 in parallel can cause more early evictions of prefetched cache lines which need to be fetched again later when the core actual need them (with an additional DRAM latency time to pay). This effect is not as unusual as it seems. Indeed, due to the high latency of DDR4 DRAMs, hardware prefetching units have to prefetch data a long time in advance so to reduce the impact of the latency. They also need to perform a lot of requests concurrently. This is generally not a problem with sequential accesses, but more cores causes accesses to look more random from the caches and IMCs point-of-view. The thing is DRAM are designed so that contiguous accesses are faster than random one (multiple contiguous cache lines should be loaded consecutively to fully saturate the bandwidth). You can analyse the value of the LLC-load-misses hardware counter to check if more data are re-fetched with more threads (I see such effect on my Skylake-based PC with only 6-cores but it is not strong enough to cause any visible impact on the final throughput). To mitigate this problem, you can use software non-temporal prefetch (prefetchnta) to request the processor to load data directly into the line fill buffer instead of the L3 cache resulting in a lower pollution (here is a related answer). This may be slower with fewer cores due to a lower concurrency, but it should be a bit faster with a lot of cores. Note that this does not solve the problem of having fetched address that looks more random from the IMCs point-of-view and there is not much to do about that.
The low-level architecture DRAM and caches is very complex in practice. More information about memory can be found in the following links:
What Every Programmer Should Know About Memory
Introduction to High Performance Scientific Computing (Section 1.3)
Lecture: Main Memory and the DRAM System
Short lectures: Dynamic Random Access Memory (in 7 parts)
Intel® 64 and IA-32 Architectures Software Developer's Manual (Volume 3)

Is CPU access asymmetric to Network card

When we have 2 CPU on a machine, do they have symmetric access to network cards (PCI)?
Essentially, for a packet processing code, processing 14M packet per second from a network card, does that matter on which CPU it runs?
Not sure if you still need an answer, but I will post an answer anyway in case someone else might need it. And I assume you are asking about hardware topology rather than OS irq affinity problems.
Comment from Jerry is not 100% correct. While NUMA is SMP, but access to memory and PCIe resources from different NUMA nodes are not symmetric. It's symmetric as opposed to the master-slave AMP architecture, not about resource access.
NIC are typically attached to CPU via PCIe link (I assume you are talking about Ethernet/IP stuff, not some HPC interconnect like InfiniBand). PCIe links root from CPU. For example, Intel® Xeon® Processor E5-2699 v4 has 30 PCIe v3.0 links and Intel X520 QDA-1 10Gbe needs 4 or 8 PCIe v3.0 lanes to connect to the CPU.
A NIC can't be connected to two CPUs at the same time as PCIe link goes directly into the CPU. It depends on the motherboards configuration which PCIe physical slot connects to which CPU socket and it can't be easily switched since it's hardwired. The PCIe topology information should be in the datasheet, or printed on the motherboard next to the PCIe slot (e.g. CPU1_PCIE8, CPU2_PCIE4).
https://www.asus.com/us/Commercial-Servers-Workstations/ESC4000_G3S/specifications/
http://www.intel.com/content/www/us/en/embedded/products/grantley/specifications.html
Accessing NIC in the same NUMA domain is faster than across NUMA domain. Some performance number for your reference could be found http://docplayer.net/5271505-Network-function-virtualization-virtualized-bras-with-linux-and-intel-architecture.html. Figure 12-16.
In summary, always use cores with NIC within the same NUMA node if possible to gain best performance.

Approximate latency to access caches and main memory via QPI (dual socket/processor)

This thread has a good list of times that it takes to access various parts of the computer architecture in a uniprocessor environment. How about in a dual processor environment, over Intel's QPI bus?
Let's assume a 64 byte packet memory is allocated on the first CPU. The second CPU has to access this via a 8.0 GT/s QPI bus, so I know the serialization latency alone is 4~ ns. What additional latency should I expect on the QPI bus?

Disparity between bus throughput and CPU throughput and their effect on sequential and parallel computing

What is the disparity between bus throughput and CPU throughput? How does this adversely impact sequential computing? How does this adversely impact parallel computing?
If your CPU can access its cache in 1 nS steps, but your memory takes 60 nS to deliver a random memory word, at some point your processor is going to read memory at 60x slow rate than the cache. If you are processing a lot of data, you may see a tremendous slow down, even for sequential programs.
If you have multiple CPUs, they will collectively have a higher bandwidth demand on the bus. Imagine a serial-access bus with 64 CPUs all trying to read from it: only one succeeds at any one moment. The consequence is it is hard to get parallelism of 64 in such a system, unless each processor stays entirely within its cache.

The same driver for multiple network cards -- performance bottleneck?

I'm using driver e1000e for multiple Intel network cards (Intel EXPI9402PT, based on 82571EB chip). The problem is that when I'm trying to utilize maximum speed (1GB) on more than one interface, speed on each interface starts to drop down.
I have my own driver in kernel space designed to just sent given packets. It just allocs packets by:
skb = dev_alloc_skb(packet->len);
and them sends them by:
result = dev->hard_start_xmit(skb,dev);
Each interface has its own instance of the driver.
For one interface I get: 120435948 bytes/sec.
For two interfaces I get: 61080233 bytes/sec and 60515294 bytes/sec.
For three interfaces I get: 28564020 bytes/sec, 27111184 bytes/sec, 27118907 bytes/sec.
What can be the cause? Is the hard_start_xmit function reentrant?
This is most likely due to a lack of bandwidth over your motherboard.
If you're trying to pump 3 Gb/s of information through a bus slower than 3 Gb/s, you'll have problems. What sort of bus are these cards on?
There may be a fix, but I think this is a physical limitation of the board, not necessarily your driver.
When I add the numbers together for 2 interfaces, the net result is slightly bigger than the output for a single interface. To me, this means the system is being slightly more efficient when using both interfaces. One possible reason might be better CPU utilization or possibly bus utilization. But note that the result is only slightly better and probably indicates that the resource causing the bottle neck is limited to 121MB/s. Once the load (3 active interfaces) exceeds this limit, performance drops dramatically to 82MB/s.
It is hard to pin down the exact cause without some additional measurements, but my guesses would be
CPU limited : Adding multiple CPU to the system would rule this out as a problem.
Memory limited : Remember that even if the device is in a x4 or x8 slot, the connection to main memory (i.e. where the SKB live) may not be able to sustain that load.
Interrupt limited : The packets per second might be high enough that switching in and out of interrupt contexts is hurting performance. This is less likely as most drivers are good about interrupt coalescing, but if possible, switch the driver to a polled mode to rule this out.

Resources