DMA transfer size - device

Is there a specific size at which DMA transfers data to memory? For example, from CPU side when data is read or written to physical memory (DRAM) it is generally at the granularity of 64 bytes (cache block size). My question is: when a device uses DMA to write to memory is the controller uses similar fixed sizes for actual data transfer?
Please note that I am not asking for whether there can be different sizes of transfer for a DMA transactions, since a bigger packet can always be broken into fixed size blocks.
Thanks
Arka

This is extremely platform-dependent. According to this information sheet on PCI-E:
Intel desktop chipsets support at most a 64-byte maximum payload while
Intel server chipsets support at most a 128-byte maximum payload. The
primary reason for this is to match the cache line size for snooping
on the front side bus.
...
Chipsets produced by vendors other than
Intel have supported a higher value; 512 bytes is the commonly known
maximum payload value for a server North Bridge.
Assuming you are talking about PCI-E, the search terms you want are "PCI Express" and "payload size".

Related

Can the Intel performance monitor counters be used to measure memory bandwidth?

Can the Intel PMU be used to measure per-core read/write memory bandwidth usage? Here "memory" means to DRAM (i.e., not hitting in any cache level).
Yes(ish), indirectly. You can use the relationship between counters (including time stamp) to infer other numbers. For example, if you sample a 1 second interval, and there are N last-level (3) cache misses, you can be pretty confident you are occupying N*CacheLineSize bytes per second.
It gets a bit stickier to relate it accurately to program activity, as those misses might reflect cpu prefetching, interrupt activity, etc.
There is also a morass of ‘this cpu doesn’t count (MMX, SSE, AVX, ..) unless this config bit is in this state’; thus rolling your own is cumbersome....
Yes, this is possible, although it is not necessarily as straightforward as programming the usual PMU counters.
One approach is to use the programmable memory controller counters which are accessed via PCI space. A good place to start is by examining Intel's own implementation in pcm-memory at pcm-memory.cpp. This app shows you the per-socket or per-memory-controller throughput, which is suitable for some uses. In particular, the bandwidth is shared among all cores, so on a quiet machine you can assume most of the bandwidth is associated with the process under test, or if you wanted to monitor at the socket level it's exactly what you want.
The other alternative is to use careful programming of the "offcore repsonse" counters. These, as far as I know, relate to traffic between the L2 (the last core-private cache) and the rest of the system. You can filter by the result of the offcore response, so you can use a combination of the various "L3 miss" events and multiply by the cache line size to get a read and write bandwidth. The events are quite fine grained, so you can further break it down by the what caused the access in the first place: instruction fetch, data demand requests, prefetching, etc, etc.
The offcore response counters generally lag behind in support by tools like perf and likwid but at least recent versions seem to have reasonable support, even for client parts like SKL.
The offcore response performance monitoring facility can be used to count all core-originated requests on the IDI from a particular core. The request type field can be used to count specific types of requests, such as demand data reads. However, to measure per-core memory bandwidth, the number of requests has to be somehow converted into bytes per second. Most requests are of the cache line size, i.e., 64 bytes. The size of other requests may not be known and could add to the memory bandwidth a number of bytes that is smaller or larger than the size of a cache line. These include cache line-split locked requests, WC requests, UC requests, and I/O requests (but these don't contribute to memory bandwidth), and fence requests that require all pending writes to be completed (MFENCE, SFENCE, and serializing instructions).
If you are only interested in cacheable bandwidth, then you can count the number of cacheable requests and multiply that by 64 bytes. This can be very accurate, assuming that cacheable cache line-split locked requests are rare. Unfortunately, writebacks from the L3 (or L4 if available) to memory cannot be counted by the offcore response facility on any of the current microarchitectures. The reason for this is that these writebacks are not core-originated and usually occur as a consequence for a conflict miss in the L3. So the request that missed in the L3 and caused the writeback can be counted, but the offcore response facility does not enable you to determine whether any request to the L3 (or L4) has caused a writeback or not. That's why it's impossible count writebacks to memory "per core."
In addition, offcore response events require a programmable performance counter that is one of 0, 1, 2, or 3 (but not 4-7 when hyptherhtreading is disabled).
Intel Xeon Broadwell support a number of Resource Director Technology (RDT) features. In particular, it supports Memory Bandwidth Monitoring (MBM), which is the only way to measure memory bandwidth accurately per core in general.
MBM has three advantages over offcore response:
It enables you to measure bandwidth of one or more tasks identified with a resource ID, rather than just per core.
It does not require one of the general-purpose programmable performance counters.
It can accurately measure local or total bandwidth, including writebacks to memory.
The advantage of offcore response is that it supports request type, supplier type, and snoop info fields.
Linux supports MBM starting with kernel version 4.6. On the 4.6 to 4.13, the MBM events are supported in perf using the following event names:
intel_cqm_llc/local_bytes - bytes sent through local socket memory controller
intel_cqm_llc/total_bytes - total L3 external bytes sent
The events can also be accessed programmatically.
Starting with 4.14, the implementation of RDT in Linux has significantly changed.
On my BDW-E5 (dual socket) system running kernel version 4.16, I can see the byte counts of MBM using the following sequence of commands:
// Mount the resctrl filesystem.
mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
// Print the number of local bytes on the first socket.
cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_local_bytes
// Print the number of total bytes on the first socket.
cat /sys/fs/resctrl/mon_data/mon_L3_00/mbm_total_bytes
// Print the number of local bytes on the second socket.
cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_local_bytes
// Print the number of total bytes on the second socket.
cat /sys/fs/resctrl/mon_data/mon_L3_01/mbm_total_bytes
My understanding is that the number of bytes is counted since system reset.
Note that by default, the resource being monitored is the whole socket.
Unfortunately, most of RDT features including MBM turned out to be buggy on Skylake processors that support it. According to SKZ4 and SKX4:
Intel® Resource Director Technology (RDT) Memory Bandwidth Monitoring
(MBM) does not count cacheable write-back traffic to local
memory. This results in the RDT MBM feature under counting total
bandwidth consumed.
That is why it's disabled by default on Linux when running on Skylake-X and Skylake-SP (which are the only Skylake processors that support MBM). You can enable MBM by adding the following parameter rdt=mbmtotal,mbmlocal to the kernel command line. There is no flag in some register to enable or disable MBM or any other RDT feature. Instead, this is tracked in some data structure in the kernel.
On the Intel Core 2 microarchitecture, memory bandwidth per core can be measured using the BUS_TRANS_MEM event as discussed here.
On some architectures, with perf you can access the uncore-PMU counters of the memory controller.
$ perf list
[...]
uncore_imc_0/cas_count_read/ [Kernel PMU event]
uncore_imc_0/cas_count_write/ [Kernel PMU event]
uncore_imc_0/clockticks/ [Kernel PMU event]
[...]
Then:
$ perf -e "uncore_imc_0/cas_count_read/,uncore_imc_0/cas_count_write/" <program> <arguments>
will report the number of Bytes transmitting from the main memory to the cache in reading and write operations from memory controller #0. Divide that number by the time used and you have an approximation of the average memory bandwidth used.
I am not sure about Intel PMU, but I think you can use Intel VTune Amplifier (https://software.intel.com/en-us/intel-vtune-amplifier-xe). This one has a lot of tools for performance monitoring (memory, cpu cache, cpu). Maybe this will work for you.

Why isn't there a data bus which is as wide as the cache line size?

When a cache miss occurs, the CPU fetches a whole cache line from main memory into the cache hierarchy. (typically 64 bytes on x86_64)
This is done via a data bus, which is only 8 byte wide on modern 64 bit systems. (since the word size is 8 byte)
EDIT: "Data bus" means the bus between the CPU die and the DRAM modules in this context. This data bus width does not necessarily correlate with the word size.
Depending on the strategy the actually requested address gets fetched at first, and then the rest of the cache line gets fetched sequentially.
It would seem much faster if there was a bus with 64 byte width , which would allow to fetch a whole cache line at once. (this would be eight times larger than the word size)
Perhaps there could be two different data bus widths, one for the standard cache line fetching and one for external hardware (DMA) that works only with word size memory access.
What are the limitations that limit the size of the data bus?
I think DRAM bus width expanded to the current 64 bits before AMD64. It's a coincidence that it matches the word size. (P5 Pentium already guaranteed atomicity of 64-bit aligned transfers, because it could do so easily with its 64-bit data bus. Of course that only applied to x87 (and later MMX) loads/stores on that 32-bit microarchitecture.)
See below: High Bandwidth Memory does use wider busses, because there's a limit to how high you can clock things, and at some point it does become advantageous to just make it massively parallel.
It would seem much faster if there was a bus with 64 byte width , which would allow to fetch a whole cache line at once.
Burst transfer size doesn't have to be correlated with bus width. Transfers to/from DRAM do happen in cache-line-sized bursts. The CPU doesn't have to send a separate command for each 64-bits, just to set up the burst transfer of a whole cache-line (read or write). If it wants less, it actually has to send an abort-burst command; there is no "single byte" or "single word" transfer command. (And yes that SDRAM wiki article still applied to DDR3/DDR4.)
Were you thinking that wider busses were necessary to reduce command overhead? They're not. (SDRAM commands are sent over separate pins from the data, so commands can be pipelined, setting up the next burst during the transfer of the current burst. Or starting earlier on opening a new row (dram page) on another bank or chip. The DDR4 wiki page has a nice chart of commands, showing how the address pins have other meanings for some commands.)
High speed parallel busses are hard to design. All the traces on the motherboard between the CPU socket and each DRAM socket must have the same propagation delay within less than 1 clock cycle. This means having them nearly the same length, and controlling inductance and capacitance to other traces because transmission-line effects are critical at frequencies high enough to be useful.
An extremely wide bus would stop you from clocking it as high, because you couldn't achieve the same tolerances. SATA and PCIe both replaced parallel busses (IDE and PCI) with high-speed serial busses. (PCIe uses multiple lanes in parallel, but each lane is its own independent link, not just part of a parallel bus).
It would just be completely impractical to use 512 data lines from the CPU socket to each channel of DRAM sockets. Typical desktop / laptop CPUs use dual-channel memory controllers (so two DIMMs can be doing different things at the same time), so this would be 1024 traces on the motherboard, and pins on the CPU socket. (This is on top of a fixed number of control lines, like RAS, CAS, and so on.)
Running an external bus at really high clock speeds does get problematic, so there's a tradeoff between width and clock speed.
For more about DRAM, see Ulrich Drepper's What Every Programmer Should Know About Memory. It gets surprisingly technical about the hardware design of DRAM modules, address lines, and mux/demuxers.
Note that RDRAM (RAMBUS) used a high speed 16-bit bus, and had higher bandwidth than PC-133 SDRAM (1600MB/s vs. 1066MB/s). (It had worse latency and ran hotter, and failed in the market for some technical and some non-technical reasons).
I guess that it helps to use a wider bus up to the width of what you can read from the physical DRAM chips in a single cycle, so you don't need as much buffering (lower latency).
Ulrich Drepper's paper (linked above) confirms this:
Based on the address lines a2
and a3 the content of one column
is then made available to the data pin of the DRAM
chip.
This happens many times in parallel on a number
of DRAM chips to produce a total number of bits corresponding
to the width of the data bus.
Inside the CPU, busses are much wider. Core2 to IvyBridge used 128-bit data paths between different levels of cache, and from execution units to L1. Haswell widened that to 256b (32B), with a 64B path between L1 and L2
High Bandwidth Memory is designed to be more tightly coupled to whatever is controlling it, and uses a 128-bit bus for each channel, with 8 channels. (for a total bandwidth of 128GB/s). HBM2 goes twice as fast, with the same width.
Instead of one 1024b bus, 8 channels of 128b is a tradeoff between having one extremely wide bus that's hard to keep in sync, vs. too much overhead from having each bit on a separate channel (like PCIe). Each bit on a separate channel is good if you need robust signals and connectors, but when you can control things better (e.g. when the memory isn't socketed), you can use wide fast busses.
Perhaps there could be two different data bus widths, one for the standard cache line fetching and one for external hardware (DMA) that works only with word size memory access.
This is already the case. DRAM controllers are integrated into the CPU, so communication from system devices like SATA controllers and network cards has to go from them to the CPU over one bus (PCIe), then to RAM (DDR3/DDR4).
The bridge from the CPU internal memory architecture to the rest of the system is called the System Agent (this basically replaces what used to be a separate Northbridge chip on the motherboard in systems without an integrated memory controller). The chipset Southbridge communicates with it over some of the PCIe lanes it provides.
On a multi-socket system, cache-coherency traffic and non-local memory access also has to happen between sockets. AMD may still use hypertransport (a 64-bit bus). Intel hardware has an extra stop on the ring bus that connects the cores inside a Xeon, and this extra connection is where data for other sockets goes in or out. IDK the width of the physical bus.
I think there is physical/cost trouble. in addition to the data lines (64) has a address lines (15+) and bank_select lines (3). Plus other lines (CS, CAS, RAS...). For example see 6th Generation Intel® Core™ Processor Family Datasheet. In general, about 90 lines for only one bus and 180 for two. There are other lines (PCIe, Dysplay...) The next aspect is burst reading. With bank_select we can select one of 8 banks. In burst mode with one writing of address at all banks we reading data from all banks by bank per tick.

Can the Rx/Tx Packet Buffer size be changed dynamically on a Linux NIC driver?

At the moment, the transmit and receive packet size is defined by a macro
#define PKT_BUF_SZ (VLAN_ETH_FRAME_LEN + NET_IP_ALIGN + 4)
So PKT_BUF_SZ comes to around 1524 bytes. So the NIC I am having can handle incoming packets from the network which are <= 1524. Anything bigger than that causes the system to crash or worse reboot. Using Linux kernel 2.6.32 and RHEL 6.0, and a custom FPGA NIC.
Is there a way to change the PKY_BUF_SZ dynamically by getting the size of the incoming packet from the NIC? Will it add to the overhead? Should the hardware drop the packets before it reaches the driver ?
Any help/suggestion will be much appreciated.
This isn't something that can be answered without knowledge of the specific controller. They all work differently in details.
Some broadcom NICs for example have different-sized pools of buffers from which the controller will select an appropriate one based on the frame size. For example, a pool of small (256) byte buffers, a pool of standard size (1536 or so) buffers, and a pool of jumbo buffers.
Some intel NICs have allowed a list of fixed size buffers together with a maximum frame size and it will then pull as many consecutive buffers as needed (not sure linux ever supported this use though -- it's much more complicated for software to handle).
But the most common model that most NICs use (and in fact, I believe all of the commercial ones can be used this way): they expect an entire frame to fit in a single buffer, and your single buffer size needs to accommodate the largest frame you will receive.
Given that your NIC is a custom FPGA one, only its designers can advise you on the specifics you're asking. If linux is crashing when larger packets come through, then most likely either your allocated buffer size is not as large as you are telling the NIC it is (leading to overflow), or the NIC has a bug that is causing it to write into some other memory area.

Accessing Large buffer from Device through DMA

I want to know how the device,cpu and os work together when we want to transfer large data like 10GB(more than the RAM available) from a DMA capable device. After doing some browsing on internet i came to know the following two approaches.
Using IOMMU ( it converts the device address to physical address)
Copying buffers to and from the peripheral's addressable memory space.
I read some relevant stackoverflow articles that we can increase dma size on boot-time, but I want to know about very large buffer which cannot fit in memory. Are these approaches proper?

The bio structure in the Linux kernel

I am reading Linux Kernel Development by Robert Love. I don't understand this paragraph about the bio structure:
The basic container for block I/O within the kernel is the bio structure, which is defined in <linux/bio.h>. This structure represents block I/O operations that are in flight (active) as a list of segments. A segment is a chunk of a buffer that is contiguous in memory. Thus, individual buffers need not be contiguous in memory. By
allowing the buffers to be described in chunks, the bio structure provides the capability for the kernel to perform block I/O operations of even a single buffer from multiple locations in memory. Vector I/O such as this is called scatter-gather I/O.
What exactly does flight(active) means?
"As a list of segments" -- are we talking about this segmentation?
What does "By allowing the buffers ... in memory" mean?
Block Devices are such device which deals with a chunk (512, 1024 bytes) of data during an I/O transaction. "struct bio" is available for block I/O operations from Kernel-Space. This structure is commonly used in block device driver development.
Q1) What exactly does flight(active) means?
Block devices are usually implemented with a File-System meant for storing files. Now when ever an user-space application initiates a File I/O operation (read, write), the kernel in turn initiates a sequence of Block I/O operation through File-System Manager. The "struct bio" keeps track of all Block I/O transactions (initiated by user app) that is to be processed. That's what is mentioned here as flight/active regions.
"Q2) As a list of segments" -- are we talking about this segmentation?
Memory buffers are required by the kernel to hold data to/from Block device.
In kernel there are two possiblilites in which the memory is allocated.
Virtual Address Continuous - Physical Address Continuous (Using kmalloc() - Provides good Performance but limited in size)
Virtual Address Continuous - Physical Address Non-continuous (Using vmalloc() - For huge memory size requirement)
Here a segment indicates the first type i.e. continuous physical memory which is used for block IO transfer. List of segment indicates a set of such continuous physical memory regions. Note that the list elements are non-continuous memory segments.
Q3) What does "By allowing the buffers ... in memory" mean?
Scatter-gather is feature which allows data transfer from/to multiple non-continuous memory location to/from device, in a single shot (read/write transaction). Here "struct bio" keeps record of multiple segments that is to be processed. Each segment is a continuous memory region whereas multiple segments are non-continuous with one another. "struct bio" provides capability to the kernel to perform scatter-gather feature.
"In flight" means an operation that has been requested, but hasn't been initiated yet.
"Segment" here means a range of memory to be read or written, a contiguous
piece of the data to be transferred as part of the operation.
"Scatter/gather I/O" is meant by scatter operations that take a contiguous range of data on disk and distributes pieces of it into memory, gather takes separate ranges of data in memory and writes them contiguously to disk. (Replace "disk" by some suitable device in the preceding.) Some I/O machinery is able to do this in one operation (and this is getting more common).
1) "In flight" means "in progress"
2) No
3) Not quite sure :)

Resources