Why did PCI Express suffer high latency in pipeline transfer mode?

Why did PCI Express suffer high latency in pipeline transfer mode? - fpga

I used the Arria V GX FPGA Starter Kit connected to a computer via PCI Express (PCIE). In the Kit, I implemented my Direct Memory Access (DMA) Read/Write using the pipeline transfer. The DMA read the data from the PC's memory then write to another region of the PC's memory via the PCIE.
The IP I used is Avalon-MM Arria V Hard IP for PCIE with the configuration: Gen1 x8, 32-bit Avalon-MM address width. The software on the PC is Visual Studio programming by C++ and using the 12.0.0 Jungo Windriver.
The project works fine but the transferring speed, especially the reading speed, is too slow. I had done a lot of projects with this DMA, so I don't think the problem is because of my DMA. I have checked the Altera SignalTap of the project, and find out that:
+ (Figure 1) There are always over 100 clocks since the DMA began to read (the first time 'read' signal is asserted) to the first returning data (the first time 'readdatavalid' signal is asserted)
+ (Figure 2) After that, there are always about 20 to 50 standby clocks between two returning data, which is too slow.
My design needs to read the data from PC: (1) very little data (about 5 to 10 data for each time); (2) random access (that's why I didn't use burst transfer). But every time a new transferring session started, over 100 clocks are wasted at the beginning and I don't know why. To conclude, Avalon memory-mapped read pipeline costs about 200 clocks just to read 5 data from the PC's memory via PCIE.
My questions are: (1) Why there are so many clocks being wasted in the read pipeline transfer via PCIE? (2) Is there anything else I can do to speed up the transfer rate?
Thank you very much

PCIE was really designed to maximize I/O bandwidth but not to minimize latency. In my designs, I have often seen such a long latency for the first read response from system memory and I do not have any suggestions on how to reduce it.
To get performance from PCIE you have to design around the latency, using burst transfers and pipelining requests so that you can keep many requests in flight in order to maximize bandwidth.

Related

Will memory that is physically closer to the CPU perform faster than memory physically farther away?

I know this may sound like a silly question considering the speeds at which computers work, but say a certain address in RAM is physically closer to the CPU on the motherboard, compared to a memory address that is located the farthest possible to the CPU, will this have an affect on the speed that the closer memory address is accessed compared to the farthest memory address?

If you're talking about NUMA accessing RAM connected to this socket vs. going over the interconnect to access RAM connected to another socket, then yes this is a well known effect. example. Otherwise, no.
Also note that signal travel time over the external memory bus is only tiny fraction of the total latency cache-miss latency cost for a CPU core. Queuing inside the CPU, time to check L3 cahce, and the internal bus between cores and memory controllers, all adds up. Tightening DDR4 CAS latency by 1 whole memory cycle makes only a small (but measurable) difference to overall memory performance (see hardware review sites benchmarking memory overclocking), other timings even less so.
No, DDR4 (and earlier) memory busses are synced to a clock and expect a response at a specific number of memory-clock cycles1 after a command (so the controller can pipeline requests without causing overlap). See What Every Programmer Should Know About Memory? for more about DDR memory commands and memory timings (and CAS latency vs. other timings).
(Wikipedia's introduction to SDRAM mentions that earlier DRAM standards were asynchronous, so yes they maybe could just reply as soon as they had the data ready. If that happened to be a whole clock cycle early, a speedup was perhaps possible.)
So memory latency is discrete, not continuous, and being 1 mm closer can't make it fractions of a nanosecond faster. The only plausible effect is if you socket all the memory into DIMM slots in a way that enables you to run tighter timings and/or a faster memory clock than with some other arrangement. Go read about memory overclocking if you want real-world experience with people who try to push systems to the limits of stability. What's best may depend on the motherboard; physical length of traces isn't the only consideration.
AFAIK, all real-world motherboard firmwares insist on using the same timings for all DIMMs on all memory channels2.
So even if one DIMM could theoretically support tighter timings than another, you couldn't actually configure a system to make that happen. e.g. because of shorter or less noisy traces, less signal reflection because it's at the end instead of middle of some traces, or whatever. Physical proximity isn't the only thing that could help.
(This is probably a good thing; interleaving physical address space across multiple DRAM channels allows sequential reads/writes to benefit from the aggregate bandwidth of all channels. But if they ran at different speeds, you might have more contention for shared busses between controllers and cores, and more time left unused.)
Memory frequency and timings are usually chosen by the firmware after reading the SPD ROM on each DIMM (memory module) to find out what memory is installed and what timings each DIMM is rated for at what frequencies.
Footnote 1: I'm not sure how transmission-line propagation delays over memory traces are accounted for when the memory controller and DIMM agree on how many cycles there should be after a read command before the DIMM starts putting data on the bus.
The CAS latency is a timing number that the memory controller programs into the "mode register" of each DIMM.
Presumably the number the DIMM sees is the actual number it uses, and the memory controller has to account for the round-trip propagation delay to know when to really expect a read burst to start arriving. Other command latencies are just times between sending different commands so propagation delay doesn't matter: the gap at the sending side equals the gap at the receiving side.
But the CAS latency seen by the memory controller includes the round-trip propagation delay for signals to go over the wires to the DIMM and back. Modern systems with DDR4-4000 have a clock that runs at 2GHz, cycle time of half a nanosecond (and transferring data on the rising and falling edge).
At light speed, 0.5ns is "only" about 15 cm, half of one of Grace Hopper's nanoseconds, and with transmission-line effects could be somewhat shorter (like maybe 2/3rd of that). On a big server motherboard it's certainly plausible that some DIMMs are far enough away from the CPU for traces to be that long.
The rated speeds on memory DIMMs are somewhat conservative so they're still supposed to work at that speed even when as far as allowed by DDR4 standards. I don't know the details, but I assume JEDEC considers this when developing DDR SDRAM standards.
If there's a "data valid" pin the DIMM asserts at the start of the read burst, that would solve the problem, but I haven't seen a mention of that on Wikipedia.
Timings are those numbers like 9-9-9-24, with the first one being CAS latency, CL. https://www.hardwaresecrets.com/understanding-ram-timings/ was an early google hit if you want to read more from a perf-tuning PoV. Also described in Ulrich Drepper's "What Every Programmer Should Know about Memory" linked earlier, from a how-it-works PoV. Note that the higher the memory clock speed, the less real time (in nanoseconds) a given number of cycles is. So CAS latency and other timings have stayed nearly constant in nanoseconds as clock frequencies have increase, or even dropped. https://www.crucial.com/articles/about-memory/difference-between-speed-and-latency shows a table.
Footnote 2: Unless we're talking about special faster memory for use as a scratchpad or cache for the larger main memory, but still off-chip. e.g. the 16GB of MCDRAM on Xeon Phi cards, separate from the 384 GB of regular DDR4. But faster memories are usually soldered down so timings are fixed, not socketed DIMMs. So I think it's fair to say that all DIMMs in a system will run with the same timings.
Other random notes:
https://www.overclock.net/threads/ram-4x-sr-or-2x-dr-for-ryzen-3000.1729606/ contained some discussion of motherboards with a "T-topology" vs. "daisy chain" for the layout of their DIMM sockets. This seems pretty self-explanatory terminology: a T would be when each of the 2 DIMMs on a channel are on opposite sides of the CPU, about equidistant from the pins. vs. "daisy chain" when both DIMMs for the same channel are on the same side of the CPU, with one farther away than the other.
I'm not sure what the recommended practice is for using the closer or farther socket. Signal reflection could be more of a concern with the near socket because it's not the end of the trace.
If you have multiple DIMMs on the same memory channel by the "chip-enable" pin , the DDR4 protocol may require they all run at the same timings. (Such DIMMs see each others commands, except there's a "chip-select" pin that the memory controller can control independently for each DIMM to control which one the command is for.
But in theory a CPU could be designed to run its different memory channels at different frequencies, or at least different timings at the same frequency if the memory controllers all share a clock. And of course in a multi-socket system, you'd expect no physical / electrical obstacle to programming different timings for the different sockets.
(I haven't played around in the BIOS on a multi-socket system for years, not since I was a cluster sysadmin in AMD K8 / K10 days). So IDK, it's possible that some BIOS might have options to control different timings for different sockets, or simply allow different auto-detect if you use slower RAM in one socket than in others. But given the price of servers and how few people run them as hobby machines, it's unlikely that vendors would bother to support or validate such a config.

Why isn't there a data bus which is as wide as the cache line size?

When a cache miss occurs, the CPU fetches a whole cache line from main memory into the cache hierarchy. (typically 64 bytes on x86_64)
This is done via a data bus, which is only 8 byte wide on modern 64 bit systems. (since the word size is 8 byte)
EDIT: "Data bus" means the bus between the CPU die and the DRAM modules in this context. This data bus width does not necessarily correlate with the word size.
Depending on the strategy the actually requested address gets fetched at first, and then the rest of the cache line gets fetched sequentially.
It would seem much faster if there was a bus with 64 byte width , which would allow to fetch a whole cache line at once. (this would be eight times larger than the word size)
Perhaps there could be two different data bus widths, one for the standard cache line fetching and one for external hardware (DMA) that works only with word size memory access.
What are the limitations that limit the size of the data bus?

I think DRAM bus width expanded to the current 64 bits before AMD64. It's a coincidence that it matches the word size. (P5 Pentium already guaranteed atomicity of 64-bit aligned transfers, because it could do so easily with its 64-bit data bus. Of course that only applied to x87 (and later MMX) loads/stores on that 32-bit microarchitecture.)
See below: High Bandwidth Memory does use wider busses, because there's a limit to how high you can clock things, and at some point it does become advantageous to just make it massively parallel.
It would seem much faster if there was a bus with 64 byte width , which would allow to fetch a whole cache line at once.
Burst transfer size doesn't have to be correlated with bus width. Transfers to/from DRAM do happen in cache-line-sized bursts. The CPU doesn't have to send a separate command for each 64-bits, just to set up the burst transfer of a whole cache-line (read or write). If it wants less, it actually has to send an abort-burst command; there is no "single byte" or "single word" transfer command. (And yes that SDRAM wiki article still applied to DDR3/DDR4.)
Were you thinking that wider busses were necessary to reduce command overhead? They're not. (SDRAM commands are sent over separate pins from the data, so commands can be pipelined, setting up the next burst during the transfer of the current burst. Or starting earlier on opening a new row (dram page) on another bank or chip. The DDR4 wiki page has a nice chart of commands, showing how the address pins have other meanings for some commands.)
High speed parallel busses are hard to design. All the traces on the motherboard between the CPU socket and each DRAM socket must have the same propagation delay within less than 1 clock cycle. This means having them nearly the same length, and controlling inductance and capacitance to other traces because transmission-line effects are critical at frequencies high enough to be useful.
An extremely wide bus would stop you from clocking it as high, because you couldn't achieve the same tolerances. SATA and PCIe both replaced parallel busses (IDE and PCI) with high-speed serial busses. (PCIe uses multiple lanes in parallel, but each lane is its own independent link, not just part of a parallel bus).
It would just be completely impractical to use 512 data lines from the CPU socket to each channel of DRAM sockets. Typical desktop / laptop CPUs use dual-channel memory controllers (so two DIMMs can be doing different things at the same time), so this would be 1024 traces on the motherboard, and pins on the CPU socket. (This is on top of a fixed number of control lines, like RAS, CAS, and so on.)
Running an external bus at really high clock speeds does get problematic, so there's a tradeoff between width and clock speed.
For more about DRAM, see Ulrich Drepper's What Every Programmer Should Know About Memory. It gets surprisingly technical about the hardware design of DRAM modules, address lines, and mux/demuxers.
Note that RDRAM (RAMBUS) used a high speed 16-bit bus, and had higher bandwidth than PC-133 SDRAM (1600MB/s vs. 1066MB/s). (It had worse latency and ran hotter, and failed in the market for some technical and some non-technical reasons).
I guess that it helps to use a wider bus up to the width of what you can read from the physical DRAM chips in a single cycle, so you don't need as much buffering (lower latency).
Ulrich Drepper's paper (linked above) confirms this:
Based on the address lines a2
and a3 the content of one column
is then made available to the data pin of the DRAM
chip.
This happens many times in parallel on a number
of DRAM chips to produce a total number of bits corresponding
to the width of the data bus.
Inside the CPU, busses are much wider. Core2 to IvyBridge used 128-bit data paths between different levels of cache, and from execution units to L1. Haswell widened that to 256b (32B), with a 64B path between L1 and L2
High Bandwidth Memory is designed to be more tightly coupled to whatever is controlling it, and uses a 128-bit bus for each channel, with 8 channels. (for a total bandwidth of 128GB/s). HBM2 goes twice as fast, with the same width.
Instead of one 1024b bus, 8 channels of 128b is a tradeoff between having one extremely wide bus that's hard to keep in sync, vs. too much overhead from having each bit on a separate channel (like PCIe). Each bit on a separate channel is good if you need robust signals and connectors, but when you can control things better (e.g. when the memory isn't socketed), you can use wide fast busses.
Perhaps there could be two different data bus widths, one for the standard cache line fetching and one for external hardware (DMA) that works only with word size memory access.
This is already the case. DRAM controllers are integrated into the CPU, so communication from system devices like SATA controllers and network cards has to go from them to the CPU over one bus (PCIe), then to RAM (DDR3/DDR4).
The bridge from the CPU internal memory architecture to the rest of the system is called the System Agent (this basically replaces what used to be a separate Northbridge chip on the motherboard in systems without an integrated memory controller). The chipset Southbridge communicates with it over some of the PCIe lanes it provides.
On a multi-socket system, cache-coherency traffic and non-local memory access also has to happen between sockets. AMD may still use hypertransport (a 64-bit bus). Intel hardware has an extra stop on the ring bus that connects the cores inside a Xeon, and this extra connection is where data for other sockets goes in or out. IDK the width of the physical bus.

I think there is physical/cost trouble. in addition to the data lines (64) has a address lines (15+) and bank_select lines (3). Plus other lines (CS, CAS, RAS...). For example see 6th Generation Intel® Core™ Processor Family Datasheet. In general, about 90 lines for only one bus and 180 for two. There are other lines (PCIe, Dysplay...) The next aspect is burst reading. With bank_select we can select one of 8 banks. In burst mode with one writing of address at all banks we reading data from all banks by bank per tick.

How to measure memory bandwidth utilization on Windows?

I have a highly threaded program but I believe it is not able to scale well across multiple cores because it is already saturating all the memory bandwidth.
Is there any tool out there which allows to measure how much of the memory bandwidth is being used?
Edit: Please note that typical profilers show things like memory leaks and memory allocation, which I am not interested in.
I am only whether the memory bandwidth is being saturated or not.

If you have a recent Intel processor, you might try to use Intel(r) Performance Counter Monitor: http://software.intel.com/en-us/articles/intel-performance-counter-monitor/ It can directly measure consumed memory bandwidth from the memory controllers.

I'd recommend the Visual Studio Sample Profiler which can collect sample events on specific hardware counters. For example, you can choose to sample on cache misses. Here's an article explaining how to choose the CPU counter, though there are other counters you can play with as well.

it would be hard to find a tool that measured memory bandwidth utilization for your application.
But since the issue you face is a suspected memory bandwidth problem, you could try and measure if your application is generating a lot of page faults / sec, which would definitely mean that you are no where near the theoretical memory bandwidth.
You should also measure how cache friendly your algorithms are. If they are thrashing the cache, your memory bandwidth utilization will be severely hampered. Google "measuring cache misses" on good sources that tells you how to do this.

It isn't possible to properly measure memory bus utilisation with any kind of software-only solution. (it used to be, back in the 80's or so. But then we got piplining, cache, out-of-order execution, multiple cores, non-uniform memory architectues with multiple busses, etc etc etc).
You absolutely have to have hardware monitoring the memory bus, to determine how 'busy' it is.
Fortunately, most PC platforms do have some, so you just need the drivers and other software to talk to it:
wenjianhn comments that there is a project specficially for intel hardware (which they call the Processor Counter Monitor) at https://github.com/opcm/pcm
For other architectures on Windows, I am not sure. But there is a project (for linux) which has a grab-bag of support for different architectures at https://github.com/RRZE-HPC/likwid
In principle, a computer engineer could attach a suitable oscilloscope to almost any PC and do the monitoring 'directly', although this is likely to require both a suitably-trained computer engineer as well as quite high performance test instruments (read: both very costly).
If you try this yourself, know that you'll likely need instruments or at least analysis which is aware of the protocol of the bus you're intending to monitor for utilisation.
This can sometimes be really easy, with some busses - eg old parallel FIFO hardware, which usually has a separate wire for 'fifo full' and another for 'fifo empty'.
Such chips are used usually between a faster bus and a slower one, on a one-way link. The 'fifo full' signal, even it it normally occasionally triggers, can be monitored for excessively 'long' levels: For the example of a USB 2.0 Hi-Speed link, this happens when the OS isn't polling the USB fifo hardware on time. Measuring the frequency and duration of these 'holdups' then lets you measure bus utilisation, but only for this USB 2.0 bus.
For a PC memory bus, I guess you could also try just monitoring how much power your RAM interface is using - which perhaps may scale with use. This might be quite difficult to do, but you may 'get lucky'. You want the current of the supply which feeds VccIO for the bus. This should actually work much better for newer PC hardware than those ancient 80's systems (which always just ran at full power when on).
A fairly ordinary oscilloscope is enough for either of those examples - you just need one that can trigger only on 'pulses longer than a given width', and leave it running until it does, which is a good way to do 'soak testing' over long periods.
You monitor utiliation either way by looking for the change in 'idle' time.
But modern PC memory busses are quite a bit more complex, and also much faster.
To do it directly by tapping the bus, you'll need at least an oscilloscope (and active probes) designed explicitly for monitoring the generation of DDR bus your PC has, along with the software analysis option (usually sold separately) to decode the protocol enough to figure out the kind of activity which is occuring on it, from which you can figure out what kind of activity you want to measure as 'idle'.
You may even need a motherboard designed to allow you to make those measurements also.
This isn't so staightfoward as just looking for periods of no activity - all DRAM needs regular refresh cycles at the very least, which may or may not happen along with obvious bus activity (some DRAM's do it automatically, some need a specific command to trigger it, some can continue to address and transfer data from banks not in refresh, some can't, etc).
So the instrument needs to be able to analyse the data deeply enough for you extract how busy it is.
Your best, and simplest bet is to find a PC hardware (CPU) vendor who has tools which do what you want, and buy that hardware so you can use those tools.
This might even involve running your application in a VM, so you can benefit from better tools in a different OS hosting it.
To this end, you'll likely want to try Linux KVM (yes, even for Windows - there are windows guest drivers for it), and also pin down your VM to specific CPUs, whilst you also configure linux to avoid putting other jobs on those same CPUs.

The same driver for multiple network cards -- performance bottleneck?

I'm using driver e1000e for multiple Intel network cards (Intel EXPI9402PT, based on 82571EB chip). The problem is that when I'm trying to utilize maximum speed (1GB) on more than one interface, speed on each interface starts to drop down.
I have my own driver in kernel space designed to just sent given packets. It just allocs packets by:
skb = dev_alloc_skb(packet->len);
and them sends them by:
result = dev->hard_start_xmit(skb,dev);
Each interface has its own instance of the driver.
For one interface I get: 120435948 bytes/sec.
For two interfaces I get: 61080233 bytes/sec and 60515294 bytes/sec.
For three interfaces I get: 28564020 bytes/sec, 27111184 bytes/sec, 27118907 bytes/sec.
What can be the cause? Is the hard_start_xmit function reentrant?

This is most likely due to a lack of bandwidth over your motherboard.
If you're trying to pump 3 Gb/s of information through a bus slower than 3 Gb/s, you'll have problems. What sort of bus are these cards on?
There may be a fix, but I think this is a physical limitation of the board, not necessarily your driver.

When I add the numbers together for 2 interfaces, the net result is slightly bigger than the output for a single interface. To me, this means the system is being slightly more efficient when using both interfaces. One possible reason might be better CPU utilization or possibly bus utilization. But note that the result is only slightly better and probably indicates that the resource causing the bottle neck is limited to 121MB/s. Once the load (3 active interfaces) exceeds this limit, performance drops dramatically to 82MB/s.
It is hard to pin down the exact cause without some additional measurements, but my guesses would be
CPU limited : Adding multiple CPU to the system would rule this out as a problem.
Memory limited : Remember that even if the device is in a x4 or x8 slot, the connection to main memory (i.e. where the SKB live) may not be able to sustain that load.
Interrupt limited : The packets per second might be high enough that switching in and out of interrupt contexts is hurting performance. This is less likely as most drivers are good about interrupt coalescing, but if possible, switch the driver to a polled mode to rule this out.

Easiest way to determine compilation performance hardware bottleneck on single PC?

I've now saved a bit of money for the hardware upgrade. What I'd like to know, which is the easiest way to measure which part of hardware is the bottleneck for compiling and should be upgraded?
Are there any clever techniques I could use? I've looked into perfmon, but it has too many counters and isn't very helpful without exact knowledge what should be looked at.
Conditions: Home development, Windows XP Pro, Visual Studio 2008
Thanks!

The question is really "what is maxed out during compilation?"
If you don't want to use perfmon, you can use something like the task monitor.
Run a compile.
See what's maxed out.
Did you go to 100% CPU for the whole time? Get more CPU -- faster or more cores or something.
Did you go to 100% memory for the whole time? Which number matters on the display? The only memory you can buy is "physical" memory. The only factor that matters is physical memory. The other things you see on the meter are not things you buy, they're adjustments to make to the way Windows works.
Did you go to "huge" amounts of I/O? You can't easily tell what's "huge", but you can conclude this. If you're not using memory and not using CPU, then you're using the only resource that's left -- you're I/O bound and you need a faster bus -- which usually means a whole new machine.
A faster HDD is of little or no value -- the bus clock speed is one limiting factor. The bus width is the other limiting factor. No one designs an ass-kicking I/O bus and then saddles it with junk HDD's. Usually, they design the bus that fits a specific cost target based on available HDD's.

A faster HDD is of little or no value -- the bus clock speed is one limiting factor. The bus width is the other limiting factor. No one designs an ass-kicking I/O bus and then saddles it with junk HDD's. Usually, they design the bus that fits a specific cost target based on available HDD's.
Garbage. Modern HDDs are slow compared to the I/O buses they are connected to. Name a single HDD that can max out a SATA 2 interface (and that is even a generation old now) for random IOPS... A hard drive is lucky to hit 10MB/s when the bus is capable of around 280MB/s.
E.g. http://www.anandtech.com/show/2948/3. Even there the SSDs are only hitting 50MB/s. It's clear the IOPs are NOT the bottleneck otherwise the HDD would do just as much as the SSDs.
I've never seen a computer IOPs bound rather than HDD bound. It doesn't happen.

Using the task monitor has already been suggested but the Sys Internals task monitor gives you more information than the built-in Windows task monitor:
Sys Internals task monitor
You might also want to see what other things are running on your PC which are using up memory and / or CPU processing power. It may be possible to remove or only run on demand things which are affecting performance.
Windows XP will only support 3GB of memory using a switch that you have to turn on and
I seem to remember that applications need to be written to actually take this into consideration.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio