The same driver for multiple network cards -- performance bottleneck? - linux-kernel

I'm using driver e1000e for multiple Intel network cards (Intel EXPI9402PT, based on 82571EB chip). The problem is that when I'm trying to utilize maximum speed (1GB) on more than one interface, speed on each interface starts to drop down.
I have my own driver in kernel space designed to just sent given packets. It just allocs packets by:
skb = dev_alloc_skb(packet->len);
and them sends them by:
result = dev->hard_start_xmit(skb,dev);
Each interface has its own instance of the driver.
For one interface I get: 120435948 bytes/sec.
For two interfaces I get: 61080233 bytes/sec and 60515294 bytes/sec.
For three interfaces I get: 28564020 bytes/sec, 27111184 bytes/sec, 27118907 bytes/sec.
What can be the cause? Is the hard_start_xmit function reentrant?

This is most likely due to a lack of bandwidth over your motherboard.
If you're trying to pump 3 Gb/s of information through a bus slower than 3 Gb/s, you'll have problems. What sort of bus are these cards on?
There may be a fix, but I think this is a physical limitation of the board, not necessarily your driver.

When I add the numbers together for 2 interfaces, the net result is slightly bigger than the output for a single interface. To me, this means the system is being slightly more efficient when using both interfaces. One possible reason might be better CPU utilization or possibly bus utilization. But note that the result is only slightly better and probably indicates that the resource causing the bottle neck is limited to 121MB/s. Once the load (3 active interfaces) exceeds this limit, performance drops dramatically to 82MB/s.
It is hard to pin down the exact cause without some additional measurements, but my guesses would be
CPU limited : Adding multiple CPU to the system would rule this out as a problem.
Memory limited : Remember that even if the device is in a x4 or x8 slot, the connection to main memory (i.e. where the SKB live) may not be able to sustain that load.
Interrupt limited : The packets per second might be high enough that switching in and out of interrupt contexts is hurting performance. This is less likely as most drivers are good about interrupt coalescing, but if possible, switch the driver to a polled mode to rule this out.

Related

Will memory that is physically closer to the CPU perform faster than memory physically farther away?

I know this may sound like a silly question considering the speeds at which computers work, but say a certain address in RAM is physically closer to the CPU on the motherboard, compared to a memory address that is located the farthest possible to the CPU, will this have an affect on the speed that the closer memory address is accessed compared to the farthest memory address?
If you're talking about NUMA accessing RAM connected to this socket vs. going over the interconnect to access RAM connected to another socket, then yes this is a well known effect. example. Otherwise, no.
Also note that signal travel time over the external memory bus is only tiny fraction of the total latency cache-miss latency cost for a CPU core. Queuing inside the CPU, time to check L3 cahce, and the internal bus between cores and memory controllers, all adds up. Tightening DDR4 CAS latency by 1 whole memory cycle makes only a small (but measurable) difference to overall memory performance (see hardware review sites benchmarking memory overclocking), other timings even less so.
No, DDR4 (and earlier) memory busses are synced to a clock and expect a response at a specific number of memory-clock cycles1 after a command (so the controller can pipeline requests without causing overlap). See What Every Programmer Should Know About Memory? for more about DDR memory commands and memory timings (and CAS latency vs. other timings).
(Wikipedia's introduction to SDRAM mentions that earlier DRAM standards were asynchronous, so yes they maybe could just reply as soon as they had the data ready. If that happened to be a whole clock cycle early, a speedup was perhaps possible.)
So memory latency is discrete, not continuous, and being 1 mm closer can't make it fractions of a nanosecond faster. The only plausible effect is if you socket all the memory into DIMM slots in a way that enables you to run tighter timings and/or a faster memory clock than with some other arrangement. Go read about memory overclocking if you want real-world experience with people who try to push systems to the limits of stability. What's best may depend on the motherboard; physical length of traces isn't the only consideration.
AFAIK, all real-world motherboard firmwares insist on using the same timings for all DIMMs on all memory channels2.
So even if one DIMM could theoretically support tighter timings than another, you couldn't actually configure a system to make that happen. e.g. because of shorter or less noisy traces, less signal reflection because it's at the end instead of middle of some traces, or whatever. Physical proximity isn't the only thing that could help.
(This is probably a good thing; interleaving physical address space across multiple DRAM channels allows sequential reads/writes to benefit from the aggregate bandwidth of all channels. But if they ran at different speeds, you might have more contention for shared busses between controllers and cores, and more time left unused.)
Memory frequency and timings are usually chosen by the firmware after reading the SPD ROM on each DIMM (memory module) to find out what memory is installed and what timings each DIMM is rated for at what frequencies.
Footnote 1: I'm not sure how transmission-line propagation delays over memory traces are accounted for when the memory controller and DIMM agree on how many cycles there should be after a read command before the DIMM starts putting data on the bus.
The CAS latency is a timing number that the memory controller programs into the "mode register" of each DIMM.
Presumably the number the DIMM sees is the actual number it uses, and the memory controller has to account for the round-trip propagation delay to know when to really expect a read burst to start arriving. Other command latencies are just times between sending different commands so propagation delay doesn't matter: the gap at the sending side equals the gap at the receiving side.
But the CAS latency seen by the memory controller includes the round-trip propagation delay for signals to go over the wires to the DIMM and back. Modern systems with DDR4-4000 have a clock that runs at 2GHz, cycle time of half a nanosecond (and transferring data on the rising and falling edge).
At light speed, 0.5ns is "only" about 15 cm, half of one of Grace Hopper's nanoseconds, and with transmission-line effects could be somewhat shorter (like maybe 2/3rd of that). On a big server motherboard it's certainly plausible that some DIMMs are far enough away from the CPU for traces to be that long.
The rated speeds on memory DIMMs are somewhat conservative so they're still supposed to work at that speed even when as far as allowed by DDR4 standards. I don't know the details, but I assume JEDEC considers this when developing DDR SDRAM standards.
If there's a "data valid" pin the DIMM asserts at the start of the read burst, that would solve the problem, but I haven't seen a mention of that on Wikipedia.
Timings are those numbers like 9-9-9-24, with the first one being CAS latency, CL. https://www.hardwaresecrets.com/understanding-ram-timings/ was an early google hit if you want to read more from a perf-tuning PoV. Also described in Ulrich Drepper's "What Every Programmer Should Know about Memory" linked earlier, from a how-it-works PoV. Note that the higher the memory clock speed, the less real time (in nanoseconds) a given number of cycles is. So CAS latency and other timings have stayed nearly constant in nanoseconds as clock frequencies have increase, or even dropped. https://www.crucial.com/articles/about-memory/difference-between-speed-and-latency shows a table.
Footnote 2: Unless we're talking about special faster memory for use as a scratchpad or cache for the larger main memory, but still off-chip. e.g. the 16GB of MCDRAM on Xeon Phi cards, separate from the 384 GB of regular DDR4. But faster memories are usually soldered down so timings are fixed, not socketed DIMMs. So I think it's fair to say that all DIMMs in a system will run with the same timings.
Other random notes:
https://www.overclock.net/threads/ram-4x-sr-or-2x-dr-for-ryzen-3000.1729606/ contained some discussion of motherboards with a "T-topology" vs. "daisy chain" for the layout of their DIMM sockets. This seems pretty self-explanatory terminology: a T would be when each of the 2 DIMMs on a channel are on opposite sides of the CPU, about equidistant from the pins. vs. "daisy chain" when both DIMMs for the same channel are on the same side of the CPU, with one farther away than the other.
I'm not sure what the recommended practice is for using the closer or farther socket. Signal reflection could be more of a concern with the near socket because it's not the end of the trace.
If you have multiple DIMMs on the same memory channel by the "chip-enable" pin , the DDR4 protocol may require they all run at the same timings. (Such DIMMs see each others commands, except there's a "chip-select" pin that the memory controller can control independently for each DIMM to control which one the command is for.
But in theory a CPU could be designed to run its different memory channels at different frequencies, or at least different timings at the same frequency if the memory controllers all share a clock. And of course in a multi-socket system, you'd expect no physical / electrical obstacle to programming different timings for the different sockets.
(I haven't played around in the BIOS on a multi-socket system for years, not since I was a cluster sysadmin in AMD K8 / K10 days). So IDK, it's possible that some BIOS might have options to control different timings for different sockets, or simply allow different auto-detect if you use slower RAM in one socket than in others. But given the price of servers and how few people run them as hobby machines, it's unlikely that vendors would bother to support or validate such a config.

Why did PCI Express suffer high latency in pipeline transfer mode?

I used the Arria V GX FPGA Starter Kit connected to a computer via PCI Express (PCIE). In the Kit, I implemented my Direct Memory Access (DMA) Read/Write using the pipeline transfer. The DMA read the data from the PC's memory then write to another region of the PC's memory via the PCIE.
The IP I used is Avalon-MM Arria V Hard IP for PCIE with the configuration: Gen1 x8, 32-bit Avalon-MM address width. The software on the PC is Visual Studio programming by C++ and using the 12.0.0 Jungo Windriver.
The project works fine but the transferring speed, especially the reading speed, is too slow. I had done a lot of projects with this DMA, so I don't think the problem is because of my DMA. I have checked the Altera SignalTap of the project, and find out that:
+ (Figure 1) There are always over 100 clocks since the DMA began to read (the first time 'read' signal is asserted) to the first returning data (the first time 'readdatavalid' signal is asserted)
+ (Figure 2) After that, there are always about 20 to 50 standby clocks between two returning data, which is too slow.
My design needs to read the data from PC: (1) very little data (about 5 to 10 data for each time); (2) random access (that's why I didn't use burst transfer). But every time a new transferring session started, over 100 clocks are wasted at the beginning and I don't know why. To conclude, Avalon memory-mapped read pipeline costs about 200 clocks just to read 5 data from the PC's memory via PCIE.
My questions are: (1) Why there are so many clocks being wasted in the read pipeline transfer via PCIE? (2) Is there anything else I can do to speed up the transfer rate?
Thank you very much
PCIE was really designed to maximize I/O bandwidth but not to minimize latency. In my designs, I have often seen such a long latency for the first read response from system memory and I do not have any suggestions on how to reduce it.
To get performance from PCIE you have to design around the latency, using burst transfers and pipelining requests so that you can keep many requests in flight in order to maximize bandwidth.

Why isn't there a data bus which is as wide as the cache line size?

When a cache miss occurs, the CPU fetches a whole cache line from main memory into the cache hierarchy. (typically 64 bytes on x86_64)
This is done via a data bus, which is only 8 byte wide on modern 64 bit systems. (since the word size is 8 byte)
EDIT: "Data bus" means the bus between the CPU die and the DRAM modules in this context. This data bus width does not necessarily correlate with the word size.
Depending on the strategy the actually requested address gets fetched at first, and then the rest of the cache line gets fetched sequentially.
It would seem much faster if there was a bus with 64 byte width , which would allow to fetch a whole cache line at once. (this would be eight times larger than the word size)
Perhaps there could be two different data bus widths, one for the standard cache line fetching and one for external hardware (DMA) that works only with word size memory access.
What are the limitations that limit the size of the data bus?
I think DRAM bus width expanded to the current 64 bits before AMD64. It's a coincidence that it matches the word size. (P5 Pentium already guaranteed atomicity of 64-bit aligned transfers, because it could do so easily with its 64-bit data bus. Of course that only applied to x87 (and later MMX) loads/stores on that 32-bit microarchitecture.)
See below: High Bandwidth Memory does use wider busses, because there's a limit to how high you can clock things, and at some point it does become advantageous to just make it massively parallel.
It would seem much faster if there was a bus with 64 byte width , which would allow to fetch a whole cache line at once.
Burst transfer size doesn't have to be correlated with bus width. Transfers to/from DRAM do happen in cache-line-sized bursts. The CPU doesn't have to send a separate command for each 64-bits, just to set up the burst transfer of a whole cache-line (read or write). If it wants less, it actually has to send an abort-burst command; there is no "single byte" or "single word" transfer command. (And yes that SDRAM wiki article still applied to DDR3/DDR4.)
Were you thinking that wider busses were necessary to reduce command overhead? They're not. (SDRAM commands are sent over separate pins from the data, so commands can be pipelined, setting up the next burst during the transfer of the current burst. Or starting earlier on opening a new row (dram page) on another bank or chip. The DDR4 wiki page has a nice chart of commands, showing how the address pins have other meanings for some commands.)
High speed parallel busses are hard to design. All the traces on the motherboard between the CPU socket and each DRAM socket must have the same propagation delay within less than 1 clock cycle. This means having them nearly the same length, and controlling inductance and capacitance to other traces because transmission-line effects are critical at frequencies high enough to be useful.
An extremely wide bus would stop you from clocking it as high, because you couldn't achieve the same tolerances. SATA and PCIe both replaced parallel busses (IDE and PCI) with high-speed serial busses. (PCIe uses multiple lanes in parallel, but each lane is its own independent link, not just part of a parallel bus).
It would just be completely impractical to use 512 data lines from the CPU socket to each channel of DRAM sockets. Typical desktop / laptop CPUs use dual-channel memory controllers (so two DIMMs can be doing different things at the same time), so this would be 1024 traces on the motherboard, and pins on the CPU socket. (This is on top of a fixed number of control lines, like RAS, CAS, and so on.)
Running an external bus at really high clock speeds does get problematic, so there's a tradeoff between width and clock speed.
For more about DRAM, see Ulrich Drepper's What Every Programmer Should Know About Memory. It gets surprisingly technical about the hardware design of DRAM modules, address lines, and mux/demuxers.
Note that RDRAM (RAMBUS) used a high speed 16-bit bus, and had higher bandwidth than PC-133 SDRAM (1600MB/s vs. 1066MB/s). (It had worse latency and ran hotter, and failed in the market for some technical and some non-technical reasons).
I guess that it helps to use a wider bus up to the width of what you can read from the physical DRAM chips in a single cycle, so you don't need as much buffering (lower latency).
Ulrich Drepper's paper (linked above) confirms this:
Based on the address lines a2
and a3 the content of one column
is then made available to the data pin of the DRAM
chip.
This happens many times in parallel on a number
of DRAM chips to produce a total number of bits corresponding
to the width of the data bus.
Inside the CPU, busses are much wider. Core2 to IvyBridge used 128-bit data paths between different levels of cache, and from execution units to L1. Haswell widened that to 256b (32B), with a 64B path between L1 and L2
High Bandwidth Memory is designed to be more tightly coupled to whatever is controlling it, and uses a 128-bit bus for each channel, with 8 channels. (for a total bandwidth of 128GB/s). HBM2 goes twice as fast, with the same width.
Instead of one 1024b bus, 8 channels of 128b is a tradeoff between having one extremely wide bus that's hard to keep in sync, vs. too much overhead from having each bit on a separate channel (like PCIe). Each bit on a separate channel is good if you need robust signals and connectors, but when you can control things better (e.g. when the memory isn't socketed), you can use wide fast busses.
Perhaps there could be two different data bus widths, one for the standard cache line fetching and one for external hardware (DMA) that works only with word size memory access.
This is already the case. DRAM controllers are integrated into the CPU, so communication from system devices like SATA controllers and network cards has to go from them to the CPU over one bus (PCIe), then to RAM (DDR3/DDR4).
The bridge from the CPU internal memory architecture to the rest of the system is called the System Agent (this basically replaces what used to be a separate Northbridge chip on the motherboard in systems without an integrated memory controller). The chipset Southbridge communicates with it over some of the PCIe lanes it provides.
On a multi-socket system, cache-coherency traffic and non-local memory access also has to happen between sockets. AMD may still use hypertransport (a 64-bit bus). Intel hardware has an extra stop on the ring bus that connects the cores inside a Xeon, and this extra connection is where data for other sockets goes in or out. IDK the width of the physical bus.
I think there is physical/cost trouble. in addition to the data lines (64) has a address lines (15+) and bank_select lines (3). Plus other lines (CS, CAS, RAS...). For example see 6th Generation Intel® Core™ Processor Family Datasheet. In general, about 90 lines for only one bus and 180 for two. There are other lines (PCIe, Dysplay...) The next aspect is burst reading. With bank_select we can select one of 8 banks. In burst mode with one writing of address at all banks we reading data from all banks by bank per tick.

Optimizing incoming UDP broadcast in Linux

Environment
Linux/RedHat
6 cores
Java 7/8
10G
Application
Its a low latency high frequency trading application
Receives broadcast via multicast UDP
There are multiple datastreams
Each Incoming packet size is less than 1K(fixed size)
Application latency is around 4 microsecond
Architecture
Separate application thread is mapped to each incoming multicast stream
Receives data from socket using multicastsocket.receive() in native
bytes
Bytes are parsed and orderbook is prepared
Problem
Inspite of tolerable app latency of around 4 microsec we are not able to receive desirable performance. We believe it is because of network latency.
Tuning steps used
Increased size of following parameters:
netdev_max_backlog
NIC ring buffer receive size
rmem_max
tcp_mem
socketreceivebuffer (in the code)
Question:
We observed that the performance of the application deteriorated after we increased the values of above mentioned parameters. What are the suggested parameters to be optimized & the recommended values. A guide towards optimizing incoming broadcast is requested?
Is there are a way to measure the network latency in a more accurate manner in environment like this. Note that the UDP sender is an external entity (exchange)
Thanks in advance
It is not clear what and how you measure.
You mention that you are receiving UDP, why are you tuning TCP buffer size?
Generally, increasing incoming socket buffer sizes may help you with packet loss on a slow receiver, but it will not reduce latency.
You may like to find out more about bufferbloat:
Bufferbloat is a phenomenon in packet-switched networks, in which excess buffering of packets causes high latency and packet delay variation (also known as jitter), as well as reducing the overall network throughput. When a router device is configured to use excessively large buffers, even very high-speed networks can become practically unusable for many interactive applications like voice calls, chat, and even web surfing.
You also use Java for a low-latency application. People normally fail to achieve this kind of latencies with Java. One of the major reasons being the garbage collector. See Quantifying the Performance
of Garbage Collection vs. Explicit Memory Management for more details:
Comparing runtime, space consumption, and virtual memory footprints
over a range of benchmarks, we show that the runtime performance
of the best-performing garbage collector is competitive with explicit
memory management when given enough memory. In particular,
when garbage collection has five times as much memory
as required, its runtime performance matches or slightly exceeds
that of explicit memory management. However, garbage collection’s
performance degrades substantially when it must use smaller
heaps. With three times as much memory, it runs 17% slower on
average, and with twice as much memory, it runs 70% slower. Garbage
collection also is more susceptible to paging when physical
memory is scarce. In such conditions, all of the garbage collectors
we examine here suffer order-of-magnitude performance penalties
relative to explicit memory management.
People doing HFT using Java often turn off garbage collection completely and restart their systems daily.

Estimating how processor frequency affects I/O performance

I am doing research about dedicated I/O software that would run on consumer hardware. Essentially it boils down to saving huge data streams for later processing. Right now I am looking for a model to estimate performance factors on x86.
Take for example the new Macbook Pro:
high-speed Thunderbolt I/O (input/output) technology delivers
an amazing 10 gigabits per second of transfer speeds in both
directions
1.25 GB/s sounds nice but most processors of the day are clocked around 2 Ghz. Multiple cores make little difference as long as only one can be assigned per network channel.
So even if the software acts as a miniature operating system and limits itself to network/disk operations, the amount of data flowing to storage can't be greater than P / (2 * N)[1] chunks per second. Although this hints the rough performance limit, I feel it's far from adequate.
What other considerations should one take estimating I/O performance in regards to processor frequency and other hardware specifics? For simplicity's sake, assume here that storage performs instantly under all circumstances.
[1] P - processor frequency; N - algorithm overhead
The hardware limiting factors are probably the I/O bus performance, say PCIe, and more recently, the FSB clock-rates, since memory controllers are moving from northbridge to the CPUs themselves.
Then, of course, you have to figure out what sort of processing you need to do on the input, and how much work it is to produce the output. These, at least for conventional software running on a CPU, are dependent on the processor clock, but not only. Writing your code to take advantage of the hardware facilities like caches, instruction-level parallelism, etc. is still a black art but can give you an order of magnitude performance boost.
Basically what I'm ranting about is that not all software is created equal, and you probably want to take that into account.
Likely, harddisk controllers will decide the harddisk I/O performance, graphics cards will decide maximum resolution and refresh I/O performance, and so on. Don't really understand the question, the CPU is becoming less and less involved in these kinds of things (well, has been for the last 10 years).
I doubt the question will even have bearing on CPUs with integrated GPUs, since the buffer to be output to screen is in external memory sharing a bus with (again) a controller on the motherboard.
It's all buffered, so I can only see CPUs affecting file performance if you somehow force the hardware buffer size to something insanely puny. Edit: and I'm pretty sure Apple will prevent you from doing such things. ;)
For Thunderbolt specifically, it's more about what the minimum CPU model is, that supports the kinds of bus speeds required by the Thunderbolt chip set version that is in the machine in question.
Thunderbolt is a raw data traffic system and performance specs are potential maximums, hence all the asterisks in the Apple specs. I believe it will indeed alleviate bottlenecks and in general give lag-free intelligent data shuffling doing many things simultaneously.
The CPU will idle-wait a shorter time for needed data, but the processing speed of the data is the same. When playing or creating a movie, codec processing time will be the same, but you will still feel a boost/lack of lag because the data is there when it needs it. For the I/O, the bottleneck will become the read/write speed of your harddisk instead, and the CPU bottleneck (for file copy operations, likely at least some code in Finder) will stay the same.
In other words, only CPU-intensive tasks such as for example movie encoding will benefit significantly from a faster CPU, while the benefits of Thunderbolt vs. a mix of interfaces will boost machines with both slow and fast CPUs.

Resources