Are there practical use cases for one PCIe Physical Function (PF) maps to hundreds or even thousands of Virtual Functions (VF)? - pci-e

It's known that a single PF can map to multiple VFs.
About the number of VFs associated with a single PF:
In PCIe 5.0 spec:
IMPLEMENTATION NOTE
VFs Spanning Multiple Bus Numbers
As an example, consider an SR-IOV Device that supports a single PF. Initially, only PF 0 is visible. Software Sets ARI Capable Hierarchy. From the SR-IOV Extended Capability it determines: InitialVFs is 600, First VF Offset is 1 and VF Stride is 1.
If software sets NumVFs in the range [0 … 255], then the Device uses a single Bus Number.
If software sets NumVFs in the range [256 … 511], then the Device uses two Bus Numbers.
If software sets NumVFs in the range [512 … 600], then the Device uses three Bus Numbers.
PF 0 and VF 0,1 through VF 0,255 are always on the first (captured) Bus Number. VF 0,256 through VF 0,511 are always on the second Bus Number (captured Bus Number plus 1). VF 0,512 through VF 0,600 are always on the third Bus Number (captured Bus Number plus 2).
From Oracle:
Each SR-IOV device can have a physical function and each physical function can have up to 64,000 virtual functions associated with it.
From the "sharing PCIe I/O bandwidth" point of view, it might be understandable to having hundres or thousands of VFs (associated with a single PF), each VF is assigned to a VM, with the assumption that most of the VFs are in idle state at a particular time point;
However, from the "chip manufacturing" point of view, for a non-trival PCIe function, duplicating hundreds or thousands of the VF part of the IP instances within a single die would make the die area too large to be practical.
So my question is, as stated in the subject line, are there practical use cases for having so many VFs associcated with a single PF?

Related

Does memory-mapped I/O work by using RAM addresses?

Imagine a processor capable of addressing an 8-bit range (I know this is ridiculously small in reality) with a 128 byte RAM. And there is some 8-bit device register mapped to address 100. In order to store a value to it, does the CPU need to store a value at address 100 or does it specifically need to store a value at address 100 within RAM? In pseudo-assembly:
STI 100, value
VS
STI RAM_start+100, value
Usually, the address of a device is specified relative to the start of the address space it lives in.
The datasheet has surely more context and will clarify if the address is relative to something else.
However, before using it you have to translate that address as the CPU would see it.
For example, if your 8-bit address range accessible with the sti instruction is split in half:
0-127 => RAM
128-255 => IO
Because the hardware is wired this way, then, as seen from the CPU, the IO address range starts at 128, so an IO address of x is accessible at 128 + x.
The CPU datasheet usually establishes the convention used to give the addresses of the devices and the memory map of the CPU.
Address spaces can be hierarchical (e.g. as in PCI) or windowed (e.g. like the legacy PCI config space on x86), can have aliases, they may require special instructions or overlaps (e.g. reads to ROM, writes to RAM).
Always refers to the CPU manual/datasheet to understand the CPU memory map and how its address range(s) is (are) routed.

What can be a reason of PCI MMIO read data delay?

Our team is currently working on custom device.
There is Cyclone V board with COM Express amd64-based PC plugged into it. This board works as PCIe native endpoint. It switches on first, then it switches PC on. PC has linux running on it with kernel 4.10 and some drivers and software working with PCI BAR0 via MMIO.
System works flawlessly until the first reboot from linux terminal. On the next boot MMIO read access is broken, while MMIO write access is OK.
Let's say there are two offsets to read, A and B, with values 0xa and 0xb respectively. Now if we read bytes from these offsets, it seems as if there is delay by 8 read operations in values retrieved:
read A ten times - returns 0xa every time
read B eight times - returns 0xa every time
read B ten times - returns 0xb every time
read A once - returns 0xb
read B seven times - returns 0xb every time
read B once - returns 0xa
read B ten times - returns 0xb
In case offsets A and B are within the same 64-bit word all works as expected.
MMIO access is done via readb/readw/readl/readq functions, the actual function used does not affect this delay at all.
Sequential reboots may fix or break MMIO reads again.
From the linux point of view, mmiotrace gives the same picture with broken data.
From device point of view, signaltap logic analyzer shows valid data values on PCIe core bus.
We have no PCI bus analyzer device, so we do not know any possibility to check data exchange in between those two points.
What can be a reason for such behaviour and how it can be fixed?

Address Translation in a Bus hierarchy

Consider a bus hierarchy which comprises of a bus A which connects with another bus B. The bus B connects other two buses C and D. ab, bc and bd are corresponding bridges. Further each of these buses have their devices attached.
A<-ab->B
B<-bc->C
B<-bd->D
I understand that depending upon application of bus, it may have its speed and address range capacities. I want to focus on address space that each of these buses. Since A is the host to all buses down hierarchy, its address range should be wide enough to uniquely determine each of the devices in hierarchy.
My understanding is, in general, a device on bus C may have numerically same (or overlapping) bus local address range as that of device on bus D . However, when these address ranges are mapped to bus upper bus B, they are mapped two different addresses ranges. Meaning a device C.c may have assigned local address range 0x000 - 0xfff and a device D.d also has same local address range 0x000 - 0xfff, but on bus B, they may map on something like C.c (0x0000-0x0fff) and D.d as (0xaaaa - 0xeb3f). Though this is mapping is actually very much specific to a platform, wanted to understand if this understanding is correct in general.
Another point that I have been assuming all along is that the bridge performs this function of address translation when data crosses in either direction. Please let me know if this is correct.
My another question is if the bridge performs this translation, when the bridge is populated with the translation table? Is there any role of bridge controller driver? How does it does that (if at all)?

How does computer really request data in a computer?

I was wondering how exactly does a CPU request data in a computer. In a 32 Bits architecture, I thought that a computer would put a destination on the address bus and would receive 4 Bytes on the data bus. I recently read on the memory alignment in computer and it confused me. I read that the CPU has to read two times the memory to access a not multiple 4 address. Why is so? The address bus lets it access not multiple 4 address.
The address bus itself, even in a 32-bit architecture, is usually not 32 bits in size. E.g. the Pentium's address bus was 29 bits. Given that it has a full 32-bit range, in the Pentium's case that means each slot in memory is eight bytes wide. So reading a value that straddles two of those slots means two reads rather than one, and alignment prevents that from happening.
Other processors (including other implementations of the 32-bit Intel architecture) have different memory access word sizes but the point is generic.

Linux block driver merge bio's

I have a block device driver which is working, after a fashion. It is for a PCIe device, and I am handling the bios directly with a make_request_fn rather than use a request queue, as the device has no seek time. However, it still has transaction overhead.
When I read consecutively from the device, I get bios with many segments (generally my maximum of 32), each consisting of 2 hardware sectors (so 2 * 2k) and this is then handled as one scatter-gather transaction to the device, saving a lot of signaling overhead. However on a write, the bios each have just one segment of 2 sectors and therefore the operations take a lot longer in total. What I would like to happen is to somehow cause the incoming bios to consist of many segments, or to merge bios sensibly together myself. What is the right approach here?
The current content of the make_request_fn is something along the lines of:
Determine read/write of the bio
For each segment in the bio, make an entry in a scatterlist* with sg_set_page
Map this scatterlist to PCI with pci_map_sg
For every segment in the scatterlist, add to a device-specific structure defining a multiple-segment DMA scatter-gather operation
Map that structure to DMA
Carry out transaction
Unmap structure and SG DMA
Call bio_endio with -EIO if failed and 0 if succeeded.
The request queue is set up like:
#define MYDEV_BLOCK_MAX_SEGS 32
#define MYDEV_SECTOR_SIZE 2048
blk_queue_make_request(mydev->queue, mydev_make_req);
set_bit(QUEUE_FLAG_NONROT, &mydev->queue->queue_flags);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_physical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_logical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_flush(mydev->queue, 0);
blk_queue_segment_boundary(mydev->queue, -1UL);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_dma_alignment(mydev->queue, 0x7);

Resources