How does x86 assigns interrupt number for PCI device in Linux? - linux-kernel

My understanding is BIOS or EFI detects the hardware during bootup and determines interrupt number, then passes it to Linux once kernel is up and running. And based on my research the lower the interrupt number the higher its priority.
My question is how does BIOS/EFI decide which hardware should have high priority over another? Is it something that is configurable or is hardcoded by BIOS/EFI?

Kind of.
When using the legacy 8259A PIC chip, one of the priority modes is based on the IRQ number - with lower IRQs having more priority.
However with the IO APIC and the MSI(X) technology the IRQ priority is handled in the LAPIC and it is configurable by the OS.
For the legacy scenario, these devices have fixed IRQs (not configurable).
The priority was assigned so that important/frequent tasks could interrupt less important/frequent ones.
Today those devices are emulated and their IRQ can be reassigned (in same case, it depends on the chipset/superio/embedded controller) if needed but that could cause some compatibility issue.
So every device that impersonate a legacy one (e.g. an HDD) is usually assigned its legacy IRQ number.
A different topic is the PCI interrupts (PCIe deprecated the INTx# lines in favour of MSI) for non legacy devices (e.g. a NIC).
Those were (are) the real programmable IRQs, each PCI-to-PCI bridge remap its four PIRQA-PIRQD input pins to its four INTA#-INTD# output pins (that are connected to the bridge's parent PIRQA-PIRQD pins in a tangled fashion).
The Host-to-PCI-bridge INTA#-INTD# connects (conceptually) to the 8259A and the IO-APIC.
The mapping is configurable with some chipset registers (e.g. see Chapter 29 of the Intel Series 200 PCH datasheet Volume 2).
So the firmware is free to remap at least the PCI interrupts for non legacy devices. I think the algorithm used is simply to assign the lower free IRQ to the most "important" device.
However, as said above, as soon as the OS switch away from the 8259A mode these priorities stop to matter.

Related

Which instructions does a CPU use to communicate with PCIe cards?

I want to understand how a CPU works and so I want to know how it communicates with a PCIe card.
Which instructions does the CPU use to initialize a PCIe port and than read and write to it?
For example OUT or MOV.
A CPU mainly communicates with PCIe cards through memory ranges they expose. This memory may be small for network or sound cards, and very large for graphics cards. Integrated GPUs have also have their own tiny memory but share most of the main memory. Most other cards also have read/write access to main memory.
To set up the PCIe device, the configuration space is written to. On x86, the BIOS or bootloader will provide the location of this data. PCI devices are connecting in a tree which may include hubs and bridges on larger computers and this can be shown in lspci -t. Thunderbolt can even connect to external devices. This is why the OS needs to recursively "probe" the tree to find PCI devices and configure them.
Synchronization uses interrupts and ring buffers. The device can send a prenegotiated interrupt to the CPU when it's done doing work. The CPU writes work to a ring buffer. It then writes another memory location that contains the head pointer. This memory location is located on the device so it can listen to writes there and wake up when there is work to do.
Most of the interaction for modern devices will use MOV instead of OUT. The I/O ports concept is very old and not very suitable for the massive amount of data on modern systems. Having devices expose their functionality as a type of memory instead of a separate mechanism allows vectorized variants of MOV to move 32 bytes or similar at a time. With graphics card and modern network cards supporting offload, they can also use their own hardware to write results back to main memory when instructed to do so. The CPU can then read the results when it's free later, again using MOV.
Before this memory access works, the OS will need to set up the memory mapping properly. The memory mapping is set in the PCI configuration space as BARs. On the CPU side it is set up in the page tables. CPUs usually have caches to keep data locally because access to RAM is slower. This causes a problem when the data needs to get to a PCI device, so the OS will set certain memory as write-through or even uncacheable so this is ensured.
The word BAR is often marketed by GPU vendors. What they are selling is the ability to map a larger region of memory at a time. Without that, OSes have been just unmapping and reinitializing by remapping a limited window of memory at a time. This exemplifies the importance of MOV accessing PCIe devices.

Does modern PC video hardware support VGA text mode in HW, or does the BIOS emulate it (with System Management Mode)?

What really happens on modern PC hardware booted in 16-bit legacy BIOS MBR mode when you store a byte such as '1' (0x31) into the VGA text (mode 03) framebuffer at physical linear address B8000? How slow is a mov [es:di], eax store with the MTRR for that region set to UC? (Experimental testing on one Kaby Lake iGPU laptop indicates that clflushopt on WC was roughly the same speed as UC for VGA memory. But without clflushopt, mov stores to WC memory never leave the CPU and don't update the screen at all, running super fast.)
If it's not an SMI for every store, is there any way to approximate this cost on a chunk of WB memory in user-space, for performance experiments without actually rebooting into real mode? (e.g. using a BSS page as a pretend framebuffer that doesn't actually display anywhere).
The corresponding font glyph appears on screen in the next refresh, but is hardware scan-out really reading that ASCII char from VRAM (or DRAM for an iGPU) and mapping to bitmap font glyphs on the fly? Or is there some software interception on each store or once per vblank so the real hardware only has to handle a bitmapped framebuffer?
Legacy BIOS booting is well known to use System Management Mode (SMM) to emulate USB kbd/mouse as a PS/2 devices. I'm wondering if it's also used for the VGA text mode framebuffer. I assume it is used for VGA I/O ports for mode-setting but it's plausible that a text framebuffer could be supported by hardware. However, most computers spend all their time in graphics mode so leaving out HW support for text mode seems like something vendors might want to do. (OTOH this blog suggests that a homebrew verilog VGA controller can implement text mode fairly simply.)
I'm specifically interested in systems using the iGPU in Intel Skylake, but would be interested in earlier / later iGPUs from Intel and AMD, and new or old discrete GPUs.
(Including vendors other than AMD and NVidia; there are some Skylake motherboards with PCI slots, not PCIe. If modern GPU firmware drivers do emulate text mode, presumably there are some old PCI video cards with hardware VGA text mode. And maybe such a card could make stores just be a PCI transaction instead of an SMI.)
My own desktop is an i7-6700k in an Asus Z170 Pro Gaming mobo, no add-on cards just iGPU with a 1920x1200 monitor on the DVI-D output. I don't know the details of the Kaby Lake i5-7300HQ system #Eldan is testing on, only the CPU model.
I found Phoenix BIOS's patent US20120159520 from 2011,
Emulating legacy video using uefi. Instead of requiring video hardware vendors to supply both UEFI and native 16-bit real mode option-ROM drivers, they propose a real-mode VGA driver (int 10h functions and so on) that calls a vendor-supplied UEFI video driver via SMM hooks.
Abstract
[...] The generic video option ROM notifies a generic video SMM driver of the request for video services. Such notification may be performed using a software system management interrupt (SMI). Upon notification, the generic video SMM driver notifies a third party UEFI video driver of the request for video services. The third party video driver provides the requested video services to the operating system. In this way, a third party UEFI graphics driver may support a wide variety of operating systems, even those that do not natively support the UEFI display protocols.
Much of the description covers handling int 10h calls and stuff like that which already obviously trap through the IVT, thus can easily run custom code that triggers an SMI on purpose. The relevant part is what they describe for direct stores into the text-mode framebuffer which need to work even for code that doesn't trigger any software or hardware interrupts. (Other than HW triggering SMI on such stores, which they say they can use if supported.)
Text Buffer Support
[0066] In certain embodiments, applications may manipulate the VGA's
text buffer directly. In such an embodiment, generic video SMM driver
130 support this in one of two ways, depending on whether the hardware
provides SMI trapping on read/write access to the 740 KB-768 KB memory
region (where the text buffers are located).
[0067] When SMI trapping is available, the hardware generates an SMI
on each read or write access. Using the trap address of the SMI trap,
the exact text column and row may be calculated and the corresponding
row and column in the virtual text screen accessed.
Alternately,
normal memory is enabled for this region and, using a periodic SMI,
generic video SMM driver 130 scans for changes in the emulated
hardware text buffer and updates the corresponding virtual text screen
maintained by the video driver. In both cases, when a change is
detected, the character is redrawn on the virtual text screen.
This is just one BIOS vendor's patent, and doesn't tell us which way most hardware actually works, or if other vendors do different things. It does essentially confirm that some hardware exists which can trap on stores in that range, though. (Unless that's just a hypothetical possibility that they decided to cover in their patent.)
For the use-case I have in mind, trapping only on screen refresh would be vastly faster than trapping on every store so I'm curious which hardware / firmware works which way.
Motivation for this question
Optimizing an incrementing ASCII decimal counter in video RAM on 7th gen Intel Core - repeatedly storing new digits for an ASCII text counter into the same few bytes of video RAM.
I tested a version of the code in 32-bit user-space under Linux, on WB memory, hoping to approximate the situation with movnti and different ways of getting the CPU to sync its WC buffer to video RAM after each store (or perhaps occasionally in a timer interrupt). But this is not realistic if the real-mode bootloader situation isn't just storing to DRAM, but instead triggering an SMI.
On WB memory, flushing movnti stores with a lock xor byte [esp], 0 is somewhat faster than flushing with clflushopt. But #Eldan reports no speed improvement for those on VGA memory after programming an MTRR to make it WC. (And the same speed as for the original doing normal stores, indicating that by default the VGA framebuffer was UC. Some older BIOSes had an option to make VGA memory WC, which they called USWC = Uncached Speculative Write Combining.)
It's not a real-world problem so I'm not looking for actual workarounds; although it would be interesting to know if manually storing pixel bytes into a VGA graphics mode could be much faster.
Summary
Do any / all real modern systems trigger an SMI on every store to the text-mode framebuffer?
If no, can we approximate a WC store+clflush to the framebuffer, using a movnti + something in user-space on WB memory? So we can easily profile with perf for performance counters.
If different BIOSes and/or hardware use different strategies, what are those strategies? (I don't want details, just a high level like "SMI every vblank to sync the VGA framebuffer to the actual hardware framebuffer")
Would a PCIe or PCI video card with hardware VGA textmode be faster than whatever integrated GPUs actually do? I'm guessing an actual PCIe write transaction would be slower than waiting for a store to hit DRAM, but that a PCIe write would be cheaper than an SMI on every store. A ballpark / order of magnitude comparison would be interesting.
These questions are all highly related, but I can split this up if there isn't as much overlap as I expect.
Do any / all real modern systems trigger an SMI on every store to the text-mode framebuffer?
For video cards, I very much doubt it. Video card manufacturers have had the "get pixel data from char+attribute" logic built into hardware since the 1980s (it predates VGA and hasn't changed much since CGA), and just cut&paste that logic into each newer design without caring much about it.
For things that are not video cards at all (e.g. remote system management tools using LAN) I don't know but suspect not (often they use a special management CPU rather than the main CPU/s so that it works even if the computer is turned "off").
If no, can we approximate a WC store+clflush to the framebuffer, using a movnti + something in user-space on WB memory?
If you're not in user-space, you can change MTTRs (on all CPUs - MTRRs must match and there's a special sequence involved) to make an area of RAM "uncached"; or use PAT in the page tables (much easier than messing with MTRRs, especially if you're using paging anyway, but slightly different behavior due to still needing cache coherency). If you are in user-space then you will have to rely on whatever the OS/kernel provides, and (depending on which OS it is) the OS/kernel may not provide any way to do this at all.
However; even if you find a way to make (an area of) RAM uncached it still won't be very similar, because you'll be writing directly to something attached to a memory controller built into the CPU (that CPU can write to extremely quickly) instead of talking to something at the other end of a PCI link (that will have higher latency and lower bandwidth from CPU's side). Even for integrated video (where it's technically the same RAM chips in the end) writes to VRAM go through a very different path (subject to remapping/GART/paging in the video card, effected by a "write mode" VGA register, effected by bit/plane mask VGA registers, etc).
Would a PCIe or PCI video card with hardware VGA textmode be faster than whatever integrated GPUs actually do?
For writes from CPU to VRAM; typically integrated video is significantly faster than discrete cards (at least for plain writes from CPU to linear frame buffers where none of the VGA's "write logic" is involved).
For extremely rough ballpark estimates; I'd expect a single write to RAM to be around 150 cycles and a single write to PCI to be close to 1000 cycles. For SMI I'd expect a few hundred cycles of latency before SMI arrives at CPU, then the cost of CPU pipeline flush, then about 500 cycles to save CPU's state (and same loading state on the return path); then the firmware's code would have to find the cause of the SMI (another few hundred cycles?) before it could know it was a write to VRAM and not something else; then it'd have to examine the saved CPU state and find and decode the instruction that made the write (because it can't know what data was being written, if it was a byte/word/dword write, etc) while taking into account previous CPU state (which mode CPU was in, code size, etc) and keeping track of how emulating the instruction effects the future CPU state (advancing RIP, etc - don't forget that they'll be emulating every instruction that can cause a write, including things like XADD, etc). Next it would have to analyze the state of (emulated) VGA registers (write mode, write mask, plane enable, whatever controls which 64 KiB bank is mapped into the legacy area, font height, ...). Basically; for SMI emulation of a write to text mode frame buffer; I'd expect it to take tens of thousands of cycles before the firmware's code overlooks a minor but important detail buried among a huge amount of complexity, causing it to do the wrong thing and be unusably broken.
Other Notes
I found Phoenix BIOS's patent US20120159520 from 2011, Emulating legacy video using uefi.
I doubt this was ever implemented, because I doubt it can ever work. There's far too many (common and obscure) things you can do with the legacy interfaces (e.g. detect vertical refresh, setup non-standard video modes like "mode X", fiddle with "display start" to implement smooth scrolling and/or page flipping, use "CRTC info" in VBE to alter video timings, etc) that isn't supported by UEFI and can't be done via. a third party video driver for UEFI.
Instead, video card manufacturers didn't bother providing UEFI drivers for about 10 years and UEFI firmware used the legacy interface to emulate UEFI services (often breaking secure boot while they were at it); until almost everything was UEFI anyway.
I assume it (SMM) is used for VGA I/O ports for mode-setting.
I assume not. The only thing vaguely related to video that I'd suspect SMM may be used for is controlling the brightness of the screen's backlight in laptops (especially for older laptops, and especially for "lid open/close events") during early boot (before OS takes over).
.. leaving out HW support for text mode seems like something vendors might want to do
I still believe that the (eventual, after the already too long "hybrid BIOS+UEFI" transition phase) removal of 30+ years of accumulated legacy mess (A20, VGA, PS/2, PIT, PIC, ...) from hardware is one of the main reasons hardware manufacturers (Intel) are/have been pushing for UEFI adoption.
Reading through various modern Intel CPU and Platform Controller Hub (PCH) datasheets, it doesn't appear that the necessary hardware is implemented. There doesn't seem to be any way to generate an SMI (System Management Interrupt) in response to processor accesses of the VGA frame buffer (physical addresses 0xA0000 - 0xBFFFF).
The memory controller in the CPU will either route accesses to VGA frame buffer to the integrated graphics controller, the PCI Express port connected directly to the CPU, or the DMI interface connecting the CPU to the PCH. While it's possible route parts VGA frame buffer separately, this appears only meant to support a separate MDA (Monochrome Display Adapter) device. The integrated graphics controller is not well documented so it's possible that it can be configured to generate an SMI on VGA frame buffer accesses, but this seems unlikely. In any case, it wouldn't work with discrete graphics.
Intel PCH's also don't seem to have any support for generating SMIs in response to VGA frame buffer accesses. This would be the most natural place for it, as it already has support for generating SMIs in response to I/O accesses to the keyboard controller, IDE controller and other legacy devices. It possible that there's some undocumented feature that does this, but it's not included in the lists of possible SMI sources given in the PCH datasheets.
Theoretically, it would be possible for a motherboard manufacture to connect a fake VGA device to the PCH through a PCI Express port and then generate SMIs using a PCH GPIO pin. However, I'm not sure this will work in practice. By the time the CPU gets the SMI it could have moved on to executing other instructions and it wouldn't be possible to examine the CPU state at the time of the frame buffer access.
(A similar problem happened with SoundBlaster 16 emulation on the SoundBlaster Live. It would generate a PCI SERR# when the legacy SoundBlaster ports were accessed, which would generate a NMI on the CPU. Unfortunately the emulation would break on many Pentium 4 motherboards because the NMI would arrive on the next or subsequent instruction.)

How is PCI segment(domain) related to multiple Host Bridges(or Root Bridges)? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 years ago.
Improve this question
I'm trying to understand how PCI segment(domain) is related to multiple Host Bridges?
Some people say multiple PCI domains corresponds to multiple Host Bridges, but some say it means multiple Root Bridges under a single Host Bridge. I'm confused and I don't find much useful information in PCI SIG base spec.
I wonder
(1) Suppose I setup 3 PCI domains in MCFG, do I have 3 Host Bridges that connects 3 CPUs and buses, or do I have 3 Root Bridges that support 3x times buses but all share a common Host Bridge in one CPU?
(2) If I have multiple Host Bridges(or Root Bridges), do these bridges share a common South Bridge(e.g., ICH9), or they have separate ones?
I'm a beginner and google did not solve my problems much. I would appreciate it if someone could give my some clues.
The wording used is confusing.
I'll try to fix my terminology with a brief and incomplete summary of the PCI and PCI Express technology.
Skip to the last section to read the answers.
PCI
The Conventional PCI bus (henceforward PCI) is a designed around the bus topology: a shared bus is used to connect all the devices.
To create more complex hierarchies some devices can operate as bridge: a bridge connects a PCI bus to another, secondary, bus.
The secondary bus can be another PCI bus (the device is called a PCI-to-PCI bridge, henceforward P2P) or a bus of a different type (e.g. PCI-to-ISA bridge).
This creates a topology of the form:
_____ _______
----------| P2P |--------| P2ISA |------------- PCI BUS 0
‾‾|‾‾ ‾‾‾|‾‾‾
-------------|---------------+----------------- ISA BUS 0
|
-------------+--------------------------------- PCI BUS 1
Informally, each PCI bus is called a PCI segment.
In the picture above, two segments are shown (PCI BUS 0 and PCI BUS 1).
PCI defined three types of transactions: Memory, IO and configuration.
The first two are assumed to be required knowledge.
The third one is used to access the configuration address space (CAS) of each device; within this CAS it's possible to meta-configure the device.
For example, where it is mapped in the system memory address space.
In order to access the CAS of a device, the devices must be addressable.
Electrically, each PCI slot (either integrated or not), in a PCI bus segment, is wired to create an addressing scheme made of three parts: device (0-31), function (0-7), register (0-255).
Each device can have up to seven logical functions, each one with a CAS of 256 bytes.
A bus number is added to the triple above to uniquely identify a device within the whole bus topology (and not only within the bus segment).
This quadruplet is called ID address.
It's important to note that these ID addresses are assigned by the software (but for the device part, which is fixed by the wiring).
They are logical, however, it is advised to number the busses sequentially from the root.
The CPU doesn't generate PCI transactions natively, a Host Bridge is necessary.
It is a bridge (conceptually a Host-to-PCI bridge) that lets the CPU performs PCI transactions.
For example, in the x86 case, any memory write or IO write not reclaimed by other agents (e.g. memory, memory mapped CPU components, legacy devices, etc.) is passed to the PCI bus by the Host Bridge.
To generate CAS transactions, an x86 CPU writes to the IO ports 0xcf8 and 0xcfc (the first contains the ID address, the second the data to read/write).
A CPU can have more than a Host Bridge, nothing prevents it, though it's very rare.
More likely, a system can have more than one CPU and with a Host Bridge integrated into each of them, a system can have more than one Host Bridge.
For PCI, each Host Bridge establishes a PCI domain: a set of bus segments.
The main characteristic of a PCI domain is that it is isolated from other PCI domains: a transaction is not required to be routable between domains.
An OS can assign the bus numbers of each PCI domain as it please, it can reuse the bus numbers or can assign them sequentially:
NON OVERLAPPING | OVERLAPPING
|
Host-to-PCI Host-to-PCI | Host-to-PCI Host-to-PCI
bridge 0 bridge 1 | bridge 0 bridge 1
|
| | | | |
| | | | |
BUS 0 | BUS 2 | | BUS 0 | BUS 0 |
| | | | | | | | |
+------+ +------+ | +------+ +------+
| | | | | | | | |
| | | | | | | | |
| BUS 1 | BUS 3 | | BUS 1 | BUS 1
Unfortunately, the word PCI domain has also a meaning in the Linux kernel, it is used to number each Host Bridge.
As far as the PCI is concerned this works, but with the introduction of PCI express, this gets confusing because PCI express has its own name for "Host Bridge number" (i.e. PCI segment group) and the term PCI domain denotes the downstream link of the PCI express root port.
PCI Express
The PCI Express bus (henceforward PCIe) is designed around a point-to-point topology: a device is connected only to another device.
To maintain a software compatibility, an extensive use of virtual P2P bridges is made.
While the basic components of the PCI bus were devices and bridges, the basic components of the PCIe are devices and switches.
From the software perspective, nothing is changed (but for new features added) and the bus is enumerated the same way: with devices and bridges.
The PCIe switch is the basic glue between devices, it has n downstream ports.
Internally the switch has a PCI bus segment, for each port a virtual P2P bridge is created in the internal bus segment (the virtual adjective is there because each P2P only responds to the CAS transaction, that's enough for PCI compatible software).
Each downstream port is a PCIe link.
A PCIe link is regarded as a PCI bus segment; this checks with the fact that the switch has a P2P bridge for each downstream port (in total there are 1 + n PCI bus segment for a switch).
A switch has one more port: the upstream port.
It is just like a downstream port but it uses a subtractive decoding, just like for a network switch, it is used to receive traffic from the "logical external network" and to route unknown destinations.
So a switch takes 1 + N + 1 PCI segment bus.
Devices are connected directly to a switch.
In the PCI case, a bridge connected the CPU to the PCI subsystem, so it's logical to expect a switch to connect the CPU to the PCIe subsystem.
This is indeed the case, with the PCI complex root (PCR).
The PCR is basically a switch with an important twist: each one of its ports establishes a new PCI domain.
This means that it is not required to route traffic from port 1 to port2 (while a switch, of course, is).
This creates a shift with the Linux terminology, as mentioned before, because Linux assigns a domain number to each Host Bridges or PCR while, as per specifications, each PCR has multiple domains.
Long story short: same word, different meanings.
The PCIe specification uses the word PCI segment group to define a numbering per PCR (simply put the PCI segment group is the base address of the extended CAS mechanism of each PCR, so there is a one-to-one mapped natively).
Due to their isolation property, the ports of the PCR are called PCIe Root Port.
Note
The term Root Bridge doesn't exist in the specification, I can only find it in the UEFI Root Bridge IO Specification as an umbrella term for both the Host Bridge and PCR (since they share similar duties).
The Host Bridge also goes under the name of Host Adapter.
Finally your question
(1) Suppose I setup 3 PCI domains in MCFG, do I have 3 Host Bridges that connects 3 CPUs and buses, or do I have 3 Root Bridges that support 3x times buses but all share a common Host Bridge in one CPU?
If you have 3 PCI domains you either have 3 Host Bridged or 3 PCIe root ports.
If by PCI domains you meant PCI buses, in the sense of PCI bus segments (irrespective of their isolation), then you can have either a single Host Bridge/PCR handling a topology with 3 busses or more than one Host Bridge/PCR handling a combination of the 3 busses.
There is no specific requirement in this case, as you can see it's possible to cascade busses with bridges.
If you want the bus to not be isolated (so not to be PCI domains) you need a single Host Bridge or a single PCIe root port.
A set of P2P bridges (either real or virtual) will connect the busses together.
(2) If I have multiple Host Bridges(or Root Bridges), do these bridges share a common South Bridge(e.g., ICH9), or they have separate ones?
The bridged platform had faded out years ago, we now have a System Agent integrated into the CPU that exposes a set of PCIe lanes (typically 20+) and a Platform Controller Hub (PCH), connected to the CPU with a DMI link.
The PCH can also be integrated into the same socket as the CPU.
The PCH exposes some more lanes that appear to be from the CPU PCR from a software perspective.
Anyway, if you had multiple Host Bridges, they were usually on different sockets but there was typically only a single south bridge for them all.
However, this was (and is not) strictly mandatory.
The modern Intel C620 PCH can operate in Endpoint Only Mode (EPO) where it is not used as the main PCH (with a firmware and boot responsibilities) but as a set of PCIe-endpoint.
The idea is that a Host Bridge just converts CPU transactions to PCI transactions, where these transactions are routed depends on the bus topology and this is by itself a very creative topic.
Where the components of this topology are integrated is another creative task, in the end it's possible to have separate chips dedicated for each Host Bridge or a big single chip shared (or partitioned) among all or even both at the same time!
DOMAIN/SEGMENT in Configuration Space addressing (Domain tends to be the Linux term, Segment is the Windows and PCISIG term) is primarily a PLATFORM level construct. DOMAIN and SEGMENT are used in this answer interchangeably. Logically, SEGMENT is the most significant selector (most significant address bits selector) in the DOMAIN:Bus:Device:Function:Offset addressing scheme of the PCI Family Configuration Space addressing mechanism. (PCIe, PCI-X, PCI, and later follow-on software compatible bus interconnects).
In PCI-Express (PCIe) and earlier specifications, DOMAIN does not appear ON THE BUS (or on the link), or in the link transaction packets. Only the BUS, DEVICE, FUNCTION, OFFSET appear in in the transactions or on the bus. However, SEGMENT does have a place in how the LOGICAL software based SEGMENT:Bus,Dev,Func:OFFSET Configuration Space software address is used to actually create a interconnect protocol packet (PCIe) or bus cycle sequence (PCI) that is described as a Configuration Space transaction. In the PCI-Express (PCIe) specification Extended Configuration Space ACCESS METHOD (ECAM), this is partially addressed. The remainder of the coverage is handled by the PCI Firmware Specification 3.2 (which covers the newer PCIe Specification software requirements).
In the modern PCIe, the Configuration Space Access Method is handled by the ECAM mechanism by the Operating System, which abstracts such configuration space access mechanisms (the mechanism of turning a CPU memory space accessing instruction into a Bus/Interconnect Configuration Space transaction). The Operating System software understands SEGMENT/DOMAIN as the highest level (top most) logical selector and address component in the SEGMENT:BUS:DEVICE:FUNC:OFFSET address scheme for the Configuration Space. How the SEGMENT moves from software logical concept to physical hardware instantiation comes in the form of the ECAM translator, or specifically in the existence of multiple ECAM translaters in the platform. The ECAM translator translates between a memory type transaction and a configuration space type transaction. The PCIe specification describes how a SINGLE ECAM translator implementation works to translate particular memory address bits in a targeted translation memory write or read, into a Configuration Space write or read. This works as follows:
Address Bits:
11:00 (Offset bits, allows 4K of configuration space per PCIe device)
14:12 Function selection bits.
19:15 Device selection Bits.
27:20 Bus Selection Bits. 1 <= n <= 8 (Maximum of 8 bits, 256 numbers, can be less)
63:28 Base address of the individual ECAM.
What is not covered clearly (or at all) really, is that multiple ECAM can exist. A platform can setup multiple ECAM regions. Why would a platform do this? Because the bus selection bit allowance of ECAM address bits (of 8 bits) is restrictive allowing for only 256 total bus in the system, which on some systems is insufficient.
The PCI Firmware Specification 3.2 (Jan 26, 2015) describes the ECAM from a software logical perspective. The Firmware Specification describes a software memory structure (present in BIOS reserved regions used by the BIOS and the Operating system to communicate) called the MCFG table. The MCFG table describes the one OR MORE instances of ECAM hardware based configuration space cycle generator present in the platform's hardware implementation. These are memory address space transaction (memory writes) to configuration space cycle transaction translation regions. e.g. ECAM hardware implementations are the mechanisms that CPU's instantiate to allow the generation of Configuration Space transactions by software. A platform implementation (usually specified and limited by the CPU/Chipset design, but sometimes also decided by the BIOS design choices) allows for some number of ECAM Configuration Space cycle generators. A platform must support at least one, and then it would have a single SEGMENT. But a platform may support multiple ECAM, and then it has a SEGMENT for each ECAM supported. The MCFG table holds one Configuration Space base address allocation structures PER ECAM that is supported on the platform. A single SEGMENT platform will have only a single entry, a multi-SEGMENT platform will have multiple entries, one per ECAM that is supported. Each entry contains the memory base address (of the ECAM region Configuration Space cycle generator), a logical SEGMENT group corresponding to this ECAM and sub-range of bus numbers, and a sub-range of bus numbers (start and end) that exist in this logical SEGMENT.
A BIOS cannot just decide it wants lots of ECAM and describe multiple ECAM and use regular memory addresses as "base address". The hardware must actually by design instantiate the memory address to Configuration Space cycle generating logic that is anchored to a specific address (sometimes fixed by CPU/chipset design, and sometimes configurable in terms of location by a CPU/chipset specific non-standard location configuration register that the BIOS can program.) In either case, the BIOS describes which ECAM are present, and depending on the design, describes the one way, or the particular manner in which the group of ECAM are configured (or disabled) to describe the active one or more active ECAM on the system. Part of such BIOS configuration include setting up configurable ECAM, choosing how many are enabled if that is configurable, where they will live (at what base addresses), and to configure what chipset devices correspond what ECAM, and to configure which Root Ports correspond and are associated with which ECAM(s). Once this BIOS internal configuration is done, then the BIOS must describe these platform hardware and BIOS decisions to the Operating system. This is done using the PCI Firmware Specification defined MCFG table that is part of the ACPI Specification for BIOS (firmware) to Operating System platform description definition mechanisms.
Using the MCFG table scheme, it is possible to have and describe multiple ECAM, and multiple logical SEGMENT, one per ECAM. It is also possible have a single SEGMENT, which is actually split among multiple ECAM as well. (multiple entries, but all using the same SEGMENT, and then non-overlapping bus numbers.) But the typical use of the MCFG in a multi-segment configuration is to allow for multiple segments, where the bus numbers are duplicated. e.g. each SEGMENT can have up to a full compliment of 256 bus, separate from the up to 256 bus that might exist in another SEGMENT.
Three groups of SOFTWARE are aware of SEGMENT in the Configuration Space Addressing. The BIOS (to create the MCFG table), the Operating System (to read the MCFG table, get the ECAM base addresses, and handle logical to physical address translation software tasks by accessing the correct ECAM, at the correct offset), and the last group is ALL OTHER Bus,Device,Function (BDF)-aware software. All software MUST be SEGMENT,BUS,DEV,FUNC aware, not just Bus,DEV,FUNC aware. Software that assume the SEGMENT is always 0, is BROKEN. If you have ever created such software, you should rush back to your desk and fix it NOW, before anyone sees it, and certainly before it is released in a product! :-)
Platform designs (in hardware ECAM support, and BIOS design) may implement multiple ECAM. This is usually done to circumvent the 256 total bus restriction that comes in using only a sigle ECAM. Because the ECAM is defined by the PCISIG, and because that definition revolves around bus number limits ON THE BUS (in transaction fields), a single ECAM cannot implement a Configuration Space Generator for more than 256 bus. However, platforms CAN and DO instantiate multiple ECAM regions, and have an MCFG table describing multiple SEGMENT with multiple ECAM base addresses, and this allows the platform to have more than 256 total busses. (But only a maximum of 256 bus in each PCIe device tree rooted at a PCIe Root Port, and also only a maximum of 256 for all Root Ports together that share a common ECAM Configuration Space generator. Each ECAM region can describe up to 256 bus in its DOMAIN (or SEGMENT). How the platform system decides to group host root ports (cache coherent domain to PCI/PCIe domain bridges) into SEGMENT is platform specific and arbitrary to the chipset/CPU design and BIOS configuration. Most platforms are fixed and simplistic (often with only a single ECAM), and some are flexible, rigorous, and configurable, allowing for a number of solutions. The type of solution that provides for the maximum amount of PCIe bus numbers to be utilized at the PLATFORM level is to support one ECAM per Root Port. (Few platforms today do this, but ALL SHOULD!) Their are two mechanisms used to describe "how" the platform decided to group devices, endpoints, and switches into SEGMENT. The first is the afore-mentioned MCFG structure, which simply lists the multiple SEGMENTS, their associated ECAM (potentially more than one if the bus ranges in one SEGMENT are split among multiple ECAM), and the base address of each ECAM (or ECAM bus number sub-region). This method by itself is generally sufficient for many enumeration tasks, as this allows the OS to enumerate all the segments, find their ECAM, and then do PCI Bus,Device,Function scan of all the devices in each SEGMENT. However, a second mechanism is also available which augments the MCFG information, the _SEG descriptor in the ACPI namespace. The ACPI specification has a generalized platform description mechanism to describe the relation of known devices in the system in manner that is Operating System independent, but which allows the Operating System to parse the data and digest the platform layout. This mechanism is called the ACPI namespace. Within the namespace, devices are "objects", so PCIe endpoints, PCIe root ports, and PCIe switches that are fixed and included on the ACPI implementing systems motherboard are typically described. In this namespace, objects appear, and have qualifiers or decorators. One such decorator "method" is the _SEG method, which describes which SEGMENT a particular object is located in. In this way, built in devices can be group into a particular ECAM access region, or more commonly particular PCIe ROOT Ports are grouped into SEGMENTS (and associated ECAM access regions, with MCFG described base addresses). Additionally, devices that can be hot-added (and which are not statically present on motherboard) can describe the the SEGMENT they create anew upon hot-plug, or the pre-existing SEGMENT that they join upon hot-plug addition, and the bus numbers in that SEGMENT that they instantiate. This is accomplished in the namespace using the _CBA method decorator on the Root Port objects that are described in such namespaces, and it is used in combination with the _SEG method. The _CBA applies only to top-level "known" hot-plugable elements, such as if two 4 CPU systems could be dynamically "joined" into a single 8 CPU system, and their respective PCIe root port elements also thus "join" the new single system that expands from the PCIe root ports of the original base 4 CPU system, to include the new additional PCIe root ports of the added 4 CPU's.
For PCIe switches that appear in slots or external expansion chassis, the _SEG (or SEGMENT value) is generally inherited from the most senior Root Port already in the platform, at the top level of that PCIe device tree. The SEGMENT that includes that root port, is the segment that all device below that Root Port belong too.
One often reads in older collateral (before the invention of multiple ECAM, and the concept of SEGMENTs in the Configuration Space and associated Operating System software), that PCIe enumeration is done by scanning through Bus numbers, then Device Numbers, then Function numbers. This is INCORRECT, and outdated. In reality, MODERN (CURRENT) BIOS and Operating System actually scan by stepping through the SEGMENT (selecting the ECAM to use), then the Bus number, then the Device number, and finally the Function number (which applies offsets within the particular ECAM) to generate Configuration Cycles on and below a particular PCIe Root Port (e.g. on a particular PCIe device tree.)
Older "PCI" mechanisms (before the Enhanced Configuration Configuration Space was defined for PCIe) are unaware of SEGMENT. The older CF8/CFC configuration space access mechanism of PCI when supported by a hardware platform (which should NO LONGER BE USED BY ANY NEWLY WRITTEN SOFTWARE, ever) generally implement the best practical solution for legacy PCI-only (not PCIe-aware) aware operating systems, that the old mechanism is hard coded to SEGMENT 0 for that mechanism. This means that PCI-only-aware sytems can only access the device in SEGMENT 0. All major operating systems have supported the PCIe ECAM enhanced mechanism for over 10 years now, and any use of the CF8/CFC mechanism by software today is considered out of date, archaic and broken, and should be replaced by modern use of ECAM mechanisms, support for the MCFG table and multiple ECAM at a minimum, and if required by the dictates of the Operating System, supplemented by ACPI Specification namespace _SEG and _CBA attribute object method information ingestion by the Operating System for full dynamic hot-plug situations. Nearly all non-hotplug situations can be handled by MCFG alone if the OS is not using ACPI hotplug methods for other tasks. If the OS uses ACPI hotplug for other device and namespace operations, then support of _SEG and _CBA is usually additionally required, both of the Operating System, and of the BIOS to generate these APCI namespace objects in the manner that describes the device association grouping to SEGMENT and thus to the specific ECAM hardware that supports Configuration Cycle generation for that device, root port, or root complex bridge or device. Modern software uses SEGMENTs, and only broken and incorrect software assumes that all devices and bridges are present in SEGMENT 0. This is not always true now, and is increasingly untrue as time goes by.
SEGMENT grouping tends to follow some logical rationale in the hardware design and the limitation upon that designs configuration. (But not always, sometimes, it is just arbitrary and weird.) For instance, on Intel 8 socket systems (large multiprocessors), the "native" coherence implementation tends to be limited to 4 sockets in most cases. When the vendor builds an 8 socket system, it is usually done by having two groups of 4 sockets each connected by a cluster switch on the coherent interconnect between them. That type of platform might implement two PCIe SEGMENTS, one per each of the 4 socket clusters. But there is no restriction on how the platform might want to make use of multiple SEGMENT and use multiple ECAM. A two socket system COULD implement multiple ECAM, one SEGMENT per CPU, in order to allow up to 256 PCIe bus per CPU, 512 total. If such a platform instead mapped all of the root ports from both of the two CPU into a single SEGMENT (much more common), then the entire platform can only have 256 bus for the whole platform. An optimally designed platform (and more expensive in terms of ECAM hardware resources) would provide an ECAM for each and every Root Port in the system, and one ECAM for each Root Complex device in the platform. A two CPU system with 8 Root Ports, 4 on each CPU, with 4 Root Complex on each CPU, would implement 16 SEGMENT, 8 of which (the Root Port ones) would each support the maximum of 256 bus, and the 8 support Root Complex device trees would instantiate sufficient bus support to map the Root Complex devices and bridges. Thus such a fully composed two socket system would support a maximum of 8*256 + built in Root Complex device required bus, given bus support on the high side of 2048+ bus. Any "real" server designed today should be designed this way. I'm still waiting to see this modern "real" Intel or AMD server, rather than the play toys being put out these days.
Most operating system software need not concern itself with "how" SEGMENT associations are implemented, they simply need to allow for the fact that the logical SEGMENT value IS PART OF THE CONFIGURATION SPACE addressing (DO NOT CODE YOUR STUFF FOR just Bus,Device,Function, assuming SEGMENT=0) it MUST BE coded for (SEGMENT, BUS, DEVICE, FUNCTION) unless you want to be labeled a slacker-type, doing it wrong, short-cut-taking imbecile. Support SEGMENT values! They are important now, and will become increasingly important in the future as even more pressure is put on the bus number space (and platforms run out of bus numbers by being limited to the lowly and restrictive 256 bus present in a single SEGMENT. Single SEGMENT restriction happens because of hardware design, but it also happens because software is not properly written and prepared for multi-SEGMENT platforms. Do not be the person that create bad single-SEGMENT (SEGMENT=0) assuming software. DO NOT BE THAT GUY!) Do not write Operating System software this way. Do not write BIOS software this way, do not write applications that are Bus,Device,Function aware, but are not SEGMENT,BUS,DEVICE,FUNCTION aware.
Platform operating system software (in the kernel in *nix, or in the HAL in Windows) takes the SEGMENT value, and uses that to select WHICH ECAM it will access (e.g. which ECAM base address it will add the BDF/offset memory address offset to). Then it uses the Bus, Device, Function values to index into the higher address bits in that ECAM, and finally it uses the Configuration Space device register offset address to fill in the lower portion of the memory address that is feed into the ECAM Configuration Space transaction generator as its memory address transaction input.
On Intel compatible platforms (Intel & AMD, etc.) the PCI Firmware Specification and the ACPI Specification describes how the BIOS tells the Operating System (Linux, Windows, FreeBSD for example) about where the one ECAM (single SEGMENT) or each of the ECAM (multiple SEGMENT) base addresses are located in the memory address space.
The SEGMENT never appears on the Bus in PCIe or PCI. The in PCIe, the RID (Routing Identifier) encodes only the Bus# and the Device/Func# of the sender. Likewise in configuration cycles, only the Bus# and Device#/Func# of the destination target is encoded in the downstream transaction. And Device/Func# can get modified treatment in PCIe if ARI (Alternative Routing ID) mode is enabled. But the SEGMENT value does not appear (nor in PCI bus sequences). It is essentially a software construct, but has a real hardware instantiation in the form of the platform hardware support (CPU and Chipset) for multiple ECAM (multi-SEGMENT) or only a single ECAM (single-segment). Note that devices in different SEGMENT can in fact still do peer to peer direct communication. Peer to Peer transactions occur using memory addressing (which is a single global shared space among all SEGMENTS, e.g. all segments still share a single unified memory address space, at least after IOMMU translation and transit through the Root Port, which is a prerequisite for potentially being on different SEGMENT). SEGMENT can have colliding and duplicated bus number spaces, but these describe DIFFERENT bus when they are present in DIFFERENT SEGMENT. A multiple ECAM system can in fact implement a single shared bus address spaces as well as independent duplicative bus number space. In practice this is a rarity, as a system would usually use a single ECAM and a single SEGMENT for single bus address space. However, some odd hardware might need to make use of a single-segment, multi-ECAM, shared single bus run split across multiple ECAM, providing for a 256 max limited bus space for some odd reason of design (usually hotplug or dynamic configuration related.)
A theoretical platform that had a) LOTS of built in devices in its "chipset/uncore" component, and two CPU sockets in its design COULD implement a four SEGMENT design if it wished. It could put all the built in CPU devices in their own SEGMENT (one per CPU), using two segments, and then map each of the CPU's PCIe root ports into their own SEGMENT, again a unique SEGMENT per CPU, for a total of 4 SEGMENT. This would allow for 4 * 256 = 1024 bus in the whole system.
A different theoretical platform, with the same two socket count, could map all devices from all CPU (in this case just two) and all built in devices, and all the present root ports into a single ECAM, and thus a single SEGMENT. Such a platform would only have the one ECAM, so it would be limited to a total of 256 bus for the entire system (and as a result, would be much more likely to run out of bus numbers if the platform was loaded up with big complicated multi-endpoint add-in cards, with PCIe switches to support the multi-device present. e.g. a fronted AI supporting GPU card, or if it had multiple fan-out switches (to increase its slot count), or if external PCIe switch extenders (to outside PCIe enclosures) were used and supported properly by that vendor.
The "best" platforms being designed for now, and the future will implement an independent ECAM for each and every Root Port, allowing for each Root Port to support up to 256 bus in its device tree alone, independent of every other Root Port. To date, I am still waiting for this platform, one that would have lots and lots of PCIe SEGMENTs corresponding to lots of PCIe Root Ports. Such designs are critical for Compose-able I/O solutions, for shared memory solutions, for tiered memory solutions, for compose-able storage solutions, for large external I/O enclosure solutions, for CXL enabled accelerators, etc. In other words, for modern computing. When a platform comes out and says it has 5% more clock than the last one, I yawn. When one comes out that support per Root Port ECAM, I will take notice, and that platform will get my gold seal.
The pressure on the bus number space is at the breaking point now, so the use of more segments in the immediate future is likely (or should be if Intel and AMD are paying attention). Technologies like CXL (a PCIe software model compatible bus infrastructure) will only increase this pressure on the Configuration Space bus number limitations of a single SEGMENT (256 bus is not a lot these days.) Every switch uses a bus internally, and every link uses a bus, and thus a large slot count, high fanout system WILL consume more than 256 bus. Multi-SEGMENT designs are here now, and will be increasingly common. FIX YOUR SOFTWARE NOW!
See:
PCI Express Base Spec 5.0 (or any earlier version)
7.2. PCI Express Enhanced Configuration Access Mechanism (ECAM)
http://www.pcisig.com
PCI Firmware Specification 3.2 (or newer) http://www.pcisig.com
4.1.2 MCFG Table Description
ACPI Specification "MCFG" definition http://uefi.org
_SEG (segment) namespace qualifier in the ACPI namespace description.
UEFI Specification http://uefi.org describes how the OS finds the MCFG table and all other ACPI based tables in modern UEFI platforms with UEFI BIOS (boot firmware).

What is the advantage of using GPIO as IRQ.?

I know that we convert the GPIO to irq, but want to understand what is the advantage of doing so ?
If we need interrupt why can't we have interrupt line only in first place and use it directly as interrupt ?
What is the advantage of using GPIO as IRQ?
If I get your question, you are asking why even bother having a GPIO? The other answers show that someone may not even want the IRQ feature of an interrupt. Typical GPIO controllers can configure an I/O as either an input or an output.
Many GPIO pads have the flexibility to be open drain. With an open drain configuration, you may have a bi-direction 'BUS' and data can be both sent and received. Here you need to change from an input to an output. You can imagine this if you bit-bash I2C communications. This type of use maybe fine if the I2C is only used to initialize some other interface at boot.
Even if the interface is not bi-directional, you might wish to capture on each edge. Various peripherals use zero crossing and a timer to decode a signal. For example a laser bar code reader, a magnetic stripe reader, or a bit-bashed UART might look at the time between zero crossings. Is the time double a bit width? Is the line high or low; then shift previous value and add two bits. In these cases you have to look at the signal to see whether the line is high or low. This can happen even if polarity shouldn't matter as short noise pulses can cause confusion.
So even for the case where you have only the input as an interrupt, the current level of the signal is often very useful. If this GPIO interrupt happens to be connected to an Ethernet controller and active high means data is ready, then you don't need to have the 'I/O' feature. However, this case is using the GPIO interrupt feature as glue logic. Often this signalling will be integrated into a dedicated module. The case where you only need the interrupt is typically some custom hardware to detect a signal (case open, power disconnect, etc) which is not industry standard.
The ARM SOC vendor has no idea which case above the OEM might use. The SOC vendor gives lots of flexibility as the transistors on the die are cheap compared to the wire bond/pins on the package. It means that you, who only use the interrupt feature, gets economies of scale (and a cheaper part) because other might be using these features and the ARM SOC vendor gets to distribute the NRE cost between more people.
In a perfect world, there is maybe no need for this. Not so long ago when tranistors where more expensive, some lines did only behave as interrupts (some M68k CPUs have this). Historically the ARM only has a single interrupt line with one common routine (the Cortex-M are different). So the interrupt source has to be determined by reading another register. As the hardware needs to capture the state of the line on the ARM, it is almost free to add the 'input controller' portion.
Also, for this reason, all of the ARM Linux GPIO drivers have a macro to convert from a GPIO pin to an interrupt number as they are usually one-to-one mapped. There is usually a single 'GIC' interrupt for the GPIO controller. There is a 'GPIO' interrupt controller which forms a tree of interrupt controllers with the GIC as the root. Typically, the GPIO irq numbers are Max GIC IRQ + port *32 + pin; so the GPIO irq numbers are just appended to the 'GIC' irq numbers.
If you were designing a bespoke ASIC for one specific system you could indeed do precisely that - only implement exactly what you need.
However, most processors/SoCs are produced as commodity products, so more flexibility allows them to be integrated in a wider variety of systems (and thus sell more). Given modern silicon processes, chip size tends to be constrained by the physical packaging, so pin count is at an absolute premium. Therefore, allowing pins to double up as either I/O or interrupt sources depending on the needs of the user offers more functionality in a given space, or the same functionality in less space, depending on which way you look at it.
It is not about "converting" anything - on a typical processor or microcontroller, a number of peripherals are connected to an interrupt controller; GPIO is just one of those peripherals. It is also by no means universally true; different devices have different capabilities, but in any case you are simply configuring a GPIO pin to generate an interrupt - that's a normal function of the GPIO not a "conversion".
Prior to ARM Cortex, ARM did not define an interrupt controller, and the core itself had only two interrupt sources (IRQ and FIQ). A vendor defined interrupt controller was required to multiplex the single IRQ over multiple peripherals. ARM Cortex defines an interrupt controller and a more flexible interrupt architecture; it is possible to achieve zero-latency interrupt from a GPIO, so there is no real advantage in accessing a dedicated interrupt? Doing that might mean the addition of external signal conditioning circuitry that is often incorporated in GPIO on the die.

What happens when we plug a piece of hardware into a computer system?

When we plug a piece of hardware into a computer system, say a NIC (Network Interface Card) or a sound card, what happens under the hood so that we coud use that piece of hardware?
I can think of the following 2 scenarios, correct me if I am wrong.
If the hardware has its own memory chips, someone will arrange for a range of address space to map to those memory chips.
If the hardware doesn't have its own memory chips, someone will allocate a range of address in the main memory of the computer system to accomodate that hardware.
I am not sure the aforemetioned someone is the operating system or the CPU.
And another question: Does hardware always need some memory to work?
Am I right on this?
Many thanks.
The world is not that easily defined.
first off look at the hardware and what it does. Take a mouse for example, it is trying to deliver x and y coordinate changes and button status, that can be as little as a few bytes or even a single byte two bits define what the other 6 mean, update x, update y, update buttons, that kind of thing. And the memory requirement is just enough to hold those bytes. Take a serial mouse there is already at least one byte of storage in the serial port so do you need any more? usb, another story just to speak usb back and forth takes memory for the messages, but that memory can be in the usb logic, so do you need any more for such small information.
NICs and sound cards are another category and more interesting. For nics you have packets of data coming and going and you need some buffer space, ring, fifo, etc to allow for multiple packets to be in flight in both directions for efficiency and interrupt latency and the like. You also need registers, these have their storage in the hardware/logic itself and wont need main memory. In both the sound card case and the nic case you can either have memory on the board with the hardware or have it use system memory that it can access semi-directly (dma, etc). Sound cards are similar but different in that you can think of the packets as being fixed sized and continuous. Basically you need to ping-pong buffers to or from the card at some rate, 44100khz 16 bit per sample stereo is 44100 * 2 * 2 = 176400 bytes per second, say for example the driver/software is preparing the next 8192 bytes at a time and while the hardware is playing the pong buffer software is filling the ping buffer, when hardware drains the pong buffer it indicates this to the software, starts draining the ping buffer and the software fills the ping buffer.
All interesting stuff but to get to the point. With the nic or sound card you could have as little as two registers, an address/command register and a data register. Quite painful but was often used in the old days in restricted systems, still used as well. Or you could go to the other extreme and desire to have all of the memory on the device mapped into system memory's address space as well as each register having its own unique address. With audio you dont really need random access to the memory so you dont really need this, graphics you do, nic cards you could argue do you leave the packet on the nic or do you make a copy in system memory where you can have a much larger software buffer/ring freeing the hardwares limited buffer/ring. If on nic then you would want random access, if not then you dont.
For isa/pci/pcie, etc on x86 systems the hardware is usually mapped directly into the processors memory space. So for 32 bit systems you can address up to 4GB, well even if you have 4GB worth of memory some of that memory you cannot get to because video cards, hardware registers, PCI, etc consume some of that address space (registers or memory or both, whatever the hardware was designed to use). As distasteful as it may appear to day this is why there was a distiction between I/O mapped I/O and memory mapped I/O on x86 systems, its another address bit if you will. You could have all of your registers in I/O space and not lose memory space, and map memory into nice neat aligned chunks, requiring less of your ram to be replaced with hardware. either way, isa had basically vendor specific ways of mapping into the memory space available to the isa bus, jumpers, interesting detection schemes with programmable address decoders, etc. PCI and its successors came up with something more standard. When the computer boots (talking x86 machines in general now) the BIOS goes out on the pcie bus and looks to see who is out there by talking to config space that is mapped per card in a known place. Using a known protocol the cards indicate the desired amount of memory they require, the BIOS then allocates out of the flat memory space for the processor chunks of memory for each device and tells the device what address and how much it has been allocated. It is certainly possible for the operating system to re-do or override this but typically the BIOS does this discovery for the system and the operating system simply reads the config space on each device which includes the vendor id and device id and then knows how and where to talk to the device. For this memory space I believe the hardware contains the memory/registers. For general system memory to dma to/from I believe the operating system and device drivers have to provide the mechanism for allocating that system memory then telling the hardware what address to dma to/from.
The x86 way of doing it with the bios handling the ugly details and having system memory address space and pci address space being the same address space has its pros and cons. A pro is that the hardware can easily dma to/from system memory because it does not have to know how to get from pcie address space to system address space. The negative is the case of a 32 bit system where pcie normally consumes up to 1GB of address space and the dram you bought for that hole is not available. The transition from 32 bit to 64 bit is slow and painful, the bioses and pcie chips are still limiting to the lower 4gig and limiting to 1gb for all the pcie devices, even if the chipset has a 64 bit mode, and this is with 64 bit processors and more than 4gb of ram. the mmu allowes for fragmented memory so that is not an issue. Slowly the chipsets and bioses are catching up but it is taking time.
USB. these are serial mostly master/slave protocols. Like a serial port but bigger and faster and more complicated, and like a serial port both the master and slave hardware need to have ram to store the messages, very much like a nic. Like a nic, in theory, you can be register based and pull the memory sequentially or have it mapped in to system memory and have random access to it, etc. Think of it this way, the usb interface can/does sit on a pcie interface even if it is on the motherboard. A number of devices are pcie devices on your motherboard even if they are not an actual pcie connector with a card. And they fall into the pcie cagetory of how you might design your interface or who has what memory where.
Some devices like video cards have lots of memory on board, more than is practical or is at least painful to allow all of it to be mapped into pcie memory space at once. And these would want to use a sliding window type arrangement. Tell the video card you want to look at address 0x0000 in the video cards address space, but your window may only be 0x1000 bytes (for example) in system/pcie space. When you want to look at addresses 0x1000 to 0x1FFF in video memory space you write some register to move the window then the same pcie memory space accesses different memory on the video card.
x86 being the dominant architecture has this overlapped pcie and system memory addressing thing but that is not how the whole world works. Other solutions include having independent system and pcie address spaces, with sliding windows, like the video card problem above, allowing you to have say a 2gb video card mapped flat in pcie space but limiting the window into pcie space to something not painful for the host system.
hardware designs are as varied as software designs. take 100 software engineers and give them a specification and you may get as many as 100 different solutions. Same with hardware give them a specification and you may get 100 different pcie designs. Some standards are in place to limit that, and/or cloning where you want to make a sound blaster compatible card, you dont change the interface, but given the freedom software has the hardware can and will vary and with the number of types of pcie devices (sound, hard disk controllers, video, usb, networking,etc) you will get that many different mixes of registers and addressable memory.
sorry for the long answer, hope this helps. I would dig through linux and/or bsd sources for device drivers along with programmers reference manuals if you can get access to them, and see how different hardware designs use register and memory space and see what designs are painful for the software folks and what designs are elegant and well done.
The answer depends on what is the interface of the hardware- is it over USB or PCI-Express? (and there could be others connectivity methods too - USB and PCI-Express are the most common)
With USB
The host learns about the newly arrived device by reading the descriptors and loads the appropriate device driver. The device would have presented its ID that is used for Plug n Play. The device is also assigned an address by the Host. Once the device driver kicks-in it configures the device and makes it ready for data transfer. The data transfer is done using IRP, the transfer technique and how the IRPs are loaded depend upon whether the transfer is isochronous data or bulk or other modes.
So to answer your second question - yes the hardware needs some memory to work. The Driver and the USB Host Controller Driver together setup the Memory on the host for the USB Device - the USB Device Driver then accordingly communicates/drives the device.
With PCI-Express
It is similar - sorry I do not have hands on experience with PCI-Express.

Resources