Address Translation in a Bus hierarchy - linux-kernel

Consider a bus hierarchy which comprises of a bus A which connects with another bus B. The bus B connects other two buses C and D. ab, bc and bd are corresponding bridges. Further each of these buses have their devices attached.
A<-ab->B
B<-bc->C
B<-bd->D
I understand that depending upon application of bus, it may have its speed and address range capacities. I want to focus on address space that each of these buses. Since A is the host to all buses down hierarchy, its address range should be wide enough to uniquely determine each of the devices in hierarchy.
My understanding is, in general, a device on bus C may have numerically same (or overlapping) bus local address range as that of device on bus D . However, when these address ranges are mapped to bus upper bus B, they are mapped two different addresses ranges. Meaning a device C.c may have assigned local address range 0x000 - 0xfff and a device D.d also has same local address range 0x000 - 0xfff, but on bus B, they may map on something like C.c (0x0000-0x0fff) and D.d as (0xaaaa - 0xeb3f). Though this is mapping is actually very much specific to a platform, wanted to understand if this understanding is correct in general.
Another point that I have been assuming all along is that the bridge performs this function of address translation when data crosses in either direction. Please let me know if this is correct.
My another question is if the bridge performs this translation, when the bridge is populated with the translation table? Is there any role of bridge controller driver? How does it does that (if at all)?

Related

Are there practical use cases for one PCIe Physical Function (PF) maps to hundreds or even thousands of Virtual Functions (VF)?

It's known that a single PF can map to multiple VFs.
About the number of VFs associated with a single PF:
In PCIe 5.0 spec:
IMPLEMENTATION NOTE
VFs Spanning Multiple Bus Numbers
As an example, consider an SR-IOV Device that supports a single PF. Initially, only PF 0 is visible. Software Sets ARI Capable Hierarchy. From the SR-IOV Extended Capability it determines: InitialVFs is 600, First VF Offset is 1 and VF Stride is 1.
If software sets NumVFs in the range [0 … 255], then the Device uses a single Bus Number.
If software sets NumVFs in the range [256 … 511], then the Device uses two Bus Numbers.
If software sets NumVFs in the range [512 … 600], then the Device uses three Bus Numbers.
PF 0 and VF 0,1 through VF 0,255 are always on the first (captured) Bus Number. VF 0,256 through VF 0,511 are always on the second Bus Number (captured Bus Number plus 1). VF 0,512 through VF 0,600 are always on the third Bus Number (captured Bus Number plus 2).
From Oracle:
Each SR-IOV device can have a physical function and each physical function can have up to 64,000 virtual functions associated with it.
From the "sharing PCIe I/O bandwidth" point of view, it might be understandable to having hundres or thousands of VFs (associated with a single PF), each VF is assigned to a VM, with the assumption that most of the VFs are in idle state at a particular time point;
However, from the "chip manufacturing" point of view, for a non-trival PCIe function, duplicating hundreds or thousands of the VF part of the IP instances within a single die would make the die area too large to be practical.
So my question is, as stated in the subject line, are there practical use cases for having so many VFs associcated with a single PF?

Does memory-mapped I/O work by using RAM addresses?

Imagine a processor capable of addressing an 8-bit range (I know this is ridiculously small in reality) with a 128 byte RAM. And there is some 8-bit device register mapped to address 100. In order to store a value to it, does the CPU need to store a value at address 100 or does it specifically need to store a value at address 100 within RAM? In pseudo-assembly:
STI 100, value
VS
STI RAM_start+100, value
Usually, the address of a device is specified relative to the start of the address space it lives in.
The datasheet has surely more context and will clarify if the address is relative to something else.
However, before using it you have to translate that address as the CPU would see it.
For example, if your 8-bit address range accessible with the sti instruction is split in half:
0-127 => RAM
128-255 => IO
Because the hardware is wired this way, then, as seen from the CPU, the IO address range starts at 128, so an IO address of x is accessible at 128 + x.
The CPU datasheet usually establishes the convention used to give the addresses of the devices and the memory map of the CPU.
Address spaces can be hierarchical (e.g. as in PCI) or windowed (e.g. like the legacy PCI config space on x86), can have aliases, they may require special instructions or overlaps (e.g. reads to ROM, writes to RAM).
Always refers to the CPU manual/datasheet to understand the CPU memory map and how its address range(s) is (are) routed.

What are the advantages of 16-bit adressing on IEEE 802.15.4 networks?

The only advantage I can think of using 16-bit instead of 64-bit addressing on a IEEE 802.15.4 network is that 6 bytes are saved in each frame. There might be a small win for memory constrained devices as well (microcontrollers), especially if they need to keep a list of many addresses.
But there are a couple of drawbacks:
A coordinator must be present to deal out short addresses
Big risk of conflicting addresses
A device might be assigned a new address without other nodes knowing
Are there any other advantages of short addressing that I'm missing?
You are correct in your reasoning, it saves 6 bytes which is a non-trivial amount given the packet size limit. This is also done with PanId vs ExtendedPanId addressing.
You are inaccurate about some of your other points though:
The coordinator does not assign short addresses. A device randomly picks one when it joins the network.
Yes, there is a 1/65000 or so chance for a collision. When this happens, BOTH devices pick a new short address and notify the network that there was an address conflict. (In practice I've seen this happen all of twice in 6 years)
This is why the binding mechanism exists. You create a binding using the 64-bit address. When transmission fails to a short address, the 64-bit address can be used to relocate the target node and correct the routing.
The short (16-bit) and simple (8-bit) addressing modes and the PAN ID Compression option allow a considerable saving of bytes in any 802.15.4 frame. You are correct that these savings are a small win for the memory-constrained devices that 802.15.4 is design to work on, however the main goal of these savings are for the effect on the radio usage.
The original design goals for 802.15.4 were along the lines of 10 metre links, 250kbit/s, low-cost, battery operated devices.
The maximum frame length in 802.15.4 is 128 bytes. The "full" addressing modes in 802.15.4 consist of a 16-but PAN ID and a 64-bit Extended Address for both the transmitter and receiver. This amounts to 20 bytes or about 15% of the available bytes in the frame. If these long addresses had to be used all of the time there would be a significant impact on the amount of application data that could be sent in any frame AND on the energy used to operate the radio transceivers in both Tx and Rx.
The 802.15.4 MAC layer defines an association process that can be used to negotiate and use shorter addressing mechanisms. The addressing that is typically used is a single 16-bit PAN ID and two 16-bit Short Ids, which amounts to 6 bytes or about 5% of the available bytes.
On your list of drawbacks:
Yes, a coordinator must hand out short addresses. How the addresses are created and allocated is not specified but the MAC layer does have mechanisms for notifying the layers above it that there are conflicts.
The risk of conflicts is not large as there are 65533 possible address to be handed out and 802.15.4 is only worried about "Layer 2" links (NB: 0xFFFF and 0xFFFE are special values). These addresses are not routable/routing/internetworking addresses (well, not from 802.15.4's perspective).
Yes, I guess a device might get a new address without the other nodes knowing but I have a hunch this question has more to do with ZigBee's addressing than with the 802.15.4 MAC addressing. Unfortunately I do not know much about ZigBee's addressing so I can't comment too much here.
I think it is important for me to point out that 802.15.4 is a layer 1 and layer 2 specification and the ZigBee is layer 3 up, i.e. ZigBee sits on top of 802.15.4.
This table is not 100% accurate, but I find it useful to think of 802.15.4 in this context:
+---------------+------------------+------------+
| Application | HTTP / FTP /Etc | COAP / Etc |
+---------------+------------------+------------+
| Transport | TCP / UDP | |
+---------------+------------------+ ZigBee |
| Network | IP | |
+---------------+------------------+------------+
| Link / MAC | WiFi / Ethernet | 802.15.4 |
| | Radio | Radio |
+---------------+------------------+------------+

How does computer really request data in a computer?

I was wondering how exactly does a CPU request data in a computer. In a 32 Bits architecture, I thought that a computer would put a destination on the address bus and would receive 4 Bytes on the data bus. I recently read on the memory alignment in computer and it confused me. I read that the CPU has to read two times the memory to access a not multiple 4 address. Why is so? The address bus lets it access not multiple 4 address.
The address bus itself, even in a 32-bit architecture, is usually not 32 bits in size. E.g. the Pentium's address bus was 29 bits. Given that it has a full 32-bit range, in the Pentium's case that means each slot in memory is eight bytes wide. So reading a value that straddles two of those slots means two reads rather than one, and alignment prevents that from happening.
Other processors (including other implementations of the 32-bit Intel architecture) have different memory access word sizes but the point is generic.

Linux block driver merge bio's

I have a block device driver which is working, after a fashion. It is for a PCIe device, and I am handling the bios directly with a make_request_fn rather than use a request queue, as the device has no seek time. However, it still has transaction overhead.
When I read consecutively from the device, I get bios with many segments (generally my maximum of 32), each consisting of 2 hardware sectors (so 2 * 2k) and this is then handled as one scatter-gather transaction to the device, saving a lot of signaling overhead. However on a write, the bios each have just one segment of 2 sectors and therefore the operations take a lot longer in total. What I would like to happen is to somehow cause the incoming bios to consist of many segments, or to merge bios sensibly together myself. What is the right approach here?
The current content of the make_request_fn is something along the lines of:
Determine read/write of the bio
For each segment in the bio, make an entry in a scatterlist* with sg_set_page
Map this scatterlist to PCI with pci_map_sg
For every segment in the scatterlist, add to a device-specific structure defining a multiple-segment DMA scatter-gather operation
Map that structure to DMA
Carry out transaction
Unmap structure and SG DMA
Call bio_endio with -EIO if failed and 0 if succeeded.
The request queue is set up like:
#define MYDEV_BLOCK_MAX_SEGS 32
#define MYDEV_SECTOR_SIZE 2048
blk_queue_make_request(mydev->queue, mydev_make_req);
set_bit(QUEUE_FLAG_NONROT, &mydev->queue->queue_flags);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_physical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_logical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_flush(mydev->queue, 0);
blk_queue_segment_boundary(mydev->queue, -1UL);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_dma_alignment(mydev->queue, 0x7);

Resources