AXI4 (Lite) Narrow Burst vs. Unaligned Burst Clarification/Compatibility - vhdl

I'm currently writing an AXI4 master that is supposed to support AXI4 Lite (AXI4L) as well.
My AXI4 master is receiving data from a 16-bit interface. This is on a Xilinx Spartan 6 FPGA and I plan on using the EDK AXI4 Interconnect IP, which has a minimum WDATA width of 32 bits.
At first I wanted to use narrow burst, i.e. AWSIZE = x"01" (2 bytes in transfer). However, I found that Xilinx' AXI Reference Guide UG761 states "narrow bursts [are] supported but [...] not recommended." Unaligned transactions are supposed to be supported.
This had me thinking. Say I start an unaligned burst:
AWLEN = x"01" (2 beats)
AWSIZE = x"02" (4 bytes in transfer")
And do the following:
AX (32-bit word #0: send hi16)
XB (32-bit word #1: send lo16)
Where A, B are my 16 bit words that start off at an unaligned (2-byte aligned) address. X means WSTRB is deasserted for the indicated 16 bit.
Is this supported or does this fall under the category "narrow burst" even through AWSIZE = x"02" (4 bytes in transfer) as opposed to AWSIZE = x"01" (2 bytes in transfer)?
Now, if this was just for AXI4, I would probably not care as much about this use case, because AXI4 peripherals are required to use the WSTRB signals. However, the AXI Reference Guide UG761 states "[AXI4L] Slaves interface can elect to ignore WSTRB (assume all bytes valid)."
I read here that many (but not all; and there is not list?) Xilinx AXI4L peripherals do elect to ignore WSTRB.
Does this mean that I'm essentially barred from doing narrow burst ("not recommended") as well as unaligned bursts ("WSTRB can be ignored") or is there an easy way to unload some of the implementation work from my master into the interconnect, guaranteeing proper system behavior when accessing AXI4L peripherals?

Your example is not a narrow burst, and should work.
The reason narrow burst is not recommended is that it gives sub-optimal performances. Both narrow-burst and data realignement cost in area and are not recommended IMHO. However, DRE has minimal bandwidth cost, while narrow burst does. If your AXI port is 100MHz 32 bits, you have 3.2GBits maximum throughput, if you use narrow burst of 16 bits 50% of the time, than your maximum throughput is reduced to 2.4GBits (32bits X 50MHz + 16bits X 50Mhz). Also, I'm not sure AXI-Lite support narrow burst or data realignement.
Your example has 2 major flaws. First, it requires 3 data-beats to transfer 32 bits, which is worst than narrow-burst (I don't think AXI is smart enough to cancel the last burst with WSTRB to 0). Second, you can't burst more than 2 16-bits at a time, which will hang your AXI infrastructure's performances if you have a lot of data to transfer.
The best way to deal with this is concatenate the 16 bits together to form a 32 bits in your block. Then you buffer these 32 bits and burst them when you have enough. This is the AXI high performance way to do this.
However, if you receive data as 16-bits, it seems you would be better using AXI-Stream, which support 16-bits but doesn't have the notion of addresses. You can map an AXI-Stream to AXI-4 using Xilinx's IP cores. Either AXI-Datamover or AXI-DMA can do that. Both do the same (in fact, AXI-DMA includes a datamover), but AXI-DMA is controlled trough an AXI-Lite interface while Datamover is controlled through additionals AXI-Streams.
As a final note, the Xilinx cores never requires narrow-burst or DRE. If you need DRE in AXI-DMA, it's done by the AXI-DMA core and not the AXI Interconnect. Also, these cores are clear-source, so you can checkout how they operate easily.

Related

How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows?

I am on the hook to analyze some "timing channels" of some x86 binary code. I am posting one question to comprehend the bsf/bsr opcodes.
So high-levelly, these two opcodes can be modeled as a "loop", which counts the leading and trailing zeros of a given operand. The x86 manual has a good formalization of these opcodes, something like the following:
IF SRC = 0
THEN
ZF ← 1;
DEST is undefined;
ELSE
ZF ← 0;
temp ← OperandSize – 1;
WHILE Bit(SRC, temp) = 0
DO
temp ← temp - 1;
OD;
DEST ← temp;
FI;
But to my suprise, bsf/bsr instructions seem to have fixed cpu cycles. According to some documents I found here: https://gmplib.org/~tege/x86-timing.pdf, seems that they always take 8 CPU cycles to finish.
So here are my questions:
I am confirming that these instructions have fixed cpu cycles. In other words, no matter what operand is given, they always take the same amount of time to process, and there is no "timing channel" behind. I cannot find corresponding specifications in Intel's official documents.
Then why it is possible? Apparently this is a "loop" or somewhat, at least high-levelly. What is the design decision behind? Easier for CPU pipelines?
BSF/BSR performance is not data dependent on any modern CPUs. See https://agner.org/optimize/, https://uops.info/, or http://instlatx64.atw.hu/ for experimental timing results, as well as the https://gmplib.org/~tege/x86-timing.pdf you found.
On modern Intel, they decode to 1 uop with 3 cycle latency and 1/clock throughput, running only on port 1. Ryzen also runs them with 3c latency for BSF, 4c latency for BSR, but multiple uops. Earlier AMD is sometimes even slower.
(Prefer rep bsf aka tzcnt in code that might run on AMD CPUs, if you don't need the FLAGS difference between bsf and tzcnt for zero inputs. lzcnt and tzcnt are fast on AMD as well, like 1 cycle latency with 3/clock throughput for lzcnt on Zen 2 (https://uops.info/). Unfortunately lzcnt and bsr aren't compatible that way, so you can't use it in an "optimistic" forward-compatible way, you have to know which you're getting.)
Your "8 cycle" (latency and throughput) cost appears to be for 32-bit BSF on AMD K8, from Granlund's table that you linked. Agner Fog's table agrees, (and shows it decodes to 21 uops instead of having a dedicated bit-scan execution unit. But the microcoded implementation is presumably still branchless and not data-dependent). No clue why you picked that number; K8 doesn't have SMT / Hyperthreading so the opportunity for an ALU-timing side channel is much reduced.
Do note that they have an output dependency on the destination register, which they leave unmodified if the input was zero. AMD documents this behaviour, Intel implements it in hardware but documents it as an "undefined" result, so unfortunately compilers won't take advantage of it and human programmers maybe should be cautious. IDK if some ancient 32-bit only CPU had different behaviour, or if Intel is planning to ever change (doubtful!), but I wish Intel would document the behaviour at least for 64-bit mode (which excludes any older CPUs).
lzcnt/tzcnt and popcnt on Intel CPUs (but not AMD) have the same output dependency before Skylake and before Cannon Lake (respectively), even though architecturally the result is well-defined for all inputs. They all use the same execution unit. (How is POPCNT implemented in hardware?). AMD Bulldozer/Ryzen builds their bit-scan execution unit without the output dependency baked in, so BSF/BSR are slower than LZCNT/TZCNT (multiple uops to handle the input=0 case, and probably also setting ZF according to the input, not the result).
(Taking advantage of that with intrinsics isn't possible; not even with MSVC's _BitScanReverse64 which uses a by-reference output arg that you could set first. MSVC doesn't respect the previous value and assumes it's output-only. VS: unexpected optimization behavior with _BitScanReverse64 intrinsic)
The pseudocode in the manual is not the implementation
(i.e. it's not necessarily how hardware or microcode works).
It gives precisely the same result in all cases, so you can use it to understand exactly what will happen for any corner cases the text leaves you wondering about. That is all.
The point is to be simple and easy to understand, and that means modeling things in terms of simple 2-input operations which happen serially. C / Fortran / typical pseudocode doesn't have operators for many-input AND, OR, or XOR, but you can build that in hardware up to a point (limited by fan-in, the opposite of fan-out).
Integer addition can be modelled as bit-serial ripple carry, but that's not how it's implemented! Instead, we get single-cycle latency for 64-bit addition with far fewer than 64 gate delays using tricks like carry lookahead adders.
The actual implementation techniques used in Intel's bit-scan / popcnt execution unit are described in US Patent US8214414 B2.
Abstract
A merged datapath for PopCount and BitScan is described. A hardware
circuit includes a compressor tree utilized for a PopCount function,
which is reused by a BitScan function (e.g., bit scan forward (BSF) or
bit scan reverse (BSR)).
Selector logic enables the compressor tree to
operate on an input word for the PopCount or BitScan operation, based
on a microprocessor instruction. The input word is encoded if a
BitScan operation is selected.
The compressor tree receives the input
word, operates on the bits as though all bits have same level of
significance (e.g., for an N-bit input word, the input word is treated
as N one-bit inputs). The result of the compressor tree circuit is a
binary value representing a number related to the operation performed
(the number of set bits for PopCount, or the bit position of the first
set bit encountered by scanning the input word).
It's fairly safe to assume that Intel's actual silicon works similarly to this. Other Intel patents for things like out-of-order machinery (ROB, RS) do tend to match up with performance experiments we can perform.
AMD may do something different, but regardless we know from performance experiments that it's not data-dependent.
It's well known that fixed latency is a hugely beneficial thing for out-of-order scheduling, so it's very surprising when instructions don't have fixed latency. Sandybridge even went so far as to standardize uop latencies to simplify the scheduler and reduce the opportunities write-back conflicts. (e.g. a 3-cycle latency uop followed by a 2-cycle latency uop to the same port would produce 2 results in the same cycle). This meant making complex-LEA (with all 3 components: [disp + base + idx*scale]) take 3 cycles instead of just 2 for the 2 additions like on previous CPUs. There are no 2-cycle latency uops on Sandybridge-family. (There are some 2-cycle latency instructions, because they decode to 2 uops with 1c latency each. The scheduler schedules uops, not instructions).
One of the few exceptions to the rule of fixed latency for ALU uops is division / sqrt, which uses a not-fully-pipelined execution unit. Division is inherently iterative, unlike multiplication where you can make wide hardware that does the partial products and partial additions in parallel.
On Intel CPUs, variable-latency for L1d cache access can produce replays of dependent uops if the data wasn't ready when the scheduler optimistically hoped it would be.
Is there a penalty when base+offset is in a different page than the base?
Why does the number of uops per iteration increase with the stride of streaming loads?
Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?
The 80x86 manual has a good description of the expected behavior, but that has nothing to do with how it's actually implemented in silicon in any model from any manufacturer.
Let's say that there's been 50 different CPU designs from Intel, 25 CPU designs from AMD, then 25 more from other manufacturers (VIA, Cyrix, SiS/Vortex, NSC, ...). Out of those 100 different CPU designs, maybe there's 20 completely different ways that BSF has been implemented, and maybe 10 of them have fixed timing, 5 have timing that depends on every bit of the source operand, and 5 depend on groups of bits of the source operand (e.g. maybe like "if highest 32 bits of 64-bit operand are zeros { switch to 32-bit logic that's 2 cycles faster }").
I am confirming that these instructions have fixed cpu cycles. In other words, no matter what operand is given, they always take the same amount of time to process, and there is no "timing channel" behind. I cannot find corresponding specifications in Intel's official documents.
You can't. More specifically, you can test or research existing CPUs, but that's a waste of time because next week Intel (or AMD or VIA or someone else) can release a new CPU that has completely different timing.
As soon as you rely on "measured from existing CPUs" you're doing it wrong. You have to rely on "architectural guarantees" that apply to all future CPUs. There is no "architectural guarantee". You have to assume that there may be a timing side-channel (even if there isn't for current CPUs)
Then why it is possible? Apparently this is a "loop" or somewhat, at least high-levelly. What is the design decision behind? Easier for CPU pipelines?
Instead of doing a 64-bit BSF, why not split it into a pair of 32-bit pieces and do them in parallel, then merge the results? Why not split it into eight 8-bit pieces? Why not use a table lookup for each 8-bit piece?
The answers posted have explained well that the implementation is different from pseudocode. But if you are still curious why the latency is fixed and not data dependent or uses any loops for that matter, you need to see electronic side of things.
One way you could implement this feature in hardware is by using a Priority encoder.
A priority encoder will accept n input lines that can be one or off (0 or 1) and give out the index of the highest priority line that is on. Below is a table from the linked Wikipedia article modified for a most significant set bit function.
input | output index of first set bit
0000 | xx undefined
0001 | 00 0
001x | 01 1
01xx | 10 2
1xxx | 11 3
x denotes the bit value does not matter and can be anything
If you see the circuit diagram on the article, there are no loops of any kind, it is all parallel.

Store forwarding Address vs Data: What the difference between STD and STA in the Intel Optimization guide?

I'm wondering if any Intel experts out there can tell me the difference between STD and STA with respect to the Intel Skylake core.
In the Intel optimization guide, there's a picture describing the "super-scalar ports" of the Intel Cores.
Here's the PDF. The picture is on page 40.
.
Here's another picture from page 78, this picture describes "Store Address" and "Store Data":
Prepares the store forwarding and store retirement logic with the address of the data being stored.
Prepares the store forwarding and store retirement logic with the data being stored.
Considering that Skylake can perform #1 3x per clock cycle, but can only perform #2 once per clock cycle, I was curious what the difference was between these two.
It seems "natural" to me that store-forwarding would be done to the address of the data. But I can't understand when store-forwarding on the data (aka: STD / Port 4) would ever be done. Are there any assembly / optimization experts out there that can help me understand exactly the difference between STD and STA is?
Intel CPUs have been splitting stores into store-address and store-data since the first P6-family microarchitecture, Pentium Pro.
But store-address and store-data uops can micro-fuse into one fused-domain uop. On Sandy/IvyBridge, indexed addressing modes are un-laminated as described in Intel's optimization manual. But Haswell and later can keep them micro-fused even in the ROB, so they aren't un-laminated. See Micro fusion and addressing modes. (Intel doesn't mention this, and Agner Fog hasn't had time to test extensively for Haswell/Skylake so his usually-good microarch PDF doesn't even mention un-lamination at all. But you should still definitely read it to learn more about how uops work and how instructions are decoded and go through the pipeline. See also other x86 performance links in the x86 tag wiki)
Considering that Skylake can perform #1 3x per clock cycle, but can only perform #2 once per clock cycle
Ports 2 and 3 can also run load uops on their AGUs, leaving the load-data part of the port unused that cycle. Port7 only has a dedicated store-AGU for simple addressing modes.
Store addressing modes with an index register can't use port 7, only p2/p3. But if you do use "simple" addressing modes for stores, the peak throughput is 2 loads + 1 store per clock.
On Nehalem and earlier (P6 family), p2 was the only load port, p3 was the store-address port, and p4 was store-data.
On IvyBridge/Sandybridge, there weren't separate ports for store-address uops, they always just ran on the AGU (Address Generation Unit) in the load ports (p23). With 256b loads / stores, the AGU was only needed every other cycle (256b load or store uops occupy the load or store-data ports for 2 cycles, but the load ports can accept a store-address uop during that 2nd cycle). So 2 load / 1 store per clock was in theory sustainable on Sandybridge, but only if most of it was with AVX 256-bit vector loads / stores running as two 128-bit halves.
Haswell added the dedicated store-AGU on port7 and widened the load/store execution units to 256b, because there aren't spare cycles when the load ports don't need their AGUs if there's a steady supply of loads.
A store-address uop writes the address (and width, I guess) into the store buffer (aka Memory Order Buffer in Intel's terminology). Having this happen separately, and possibly before the data to be stored is even ready lets later loads (in program order) detect whether they overlap the store or not.
Out-of-order execution of loads when there are pending stores with unknown address is problematic: a wrong guess means having to roll back the pipeline. (I think the machine_clears.memory_ordering perf counter event includes this. It is possible to get non-zero counts for this from single-threaded code, but I forget if I had definite evidence that Skylake sometimes speculatively guesses that loads don't overlap unknown-address stores).
As David Kanter points out in his Haswell microarch writeup, a load uop also needs to probe the store buffer to check for forwarding / conflicts, so an execution unit that only runs store-address uops is cheaper to build.
Anyway, I'm not sure what the performance implications would be if Intel redesigned things so port7 had a full AGU that could handle indexed addressing modes, too, and made store-address uops only run on p7, not p2/p3.
That would stop store-address uops from "stealing" p23, which does happen and which reduces max sustained L1D bandwidth from 96 bytes / cycle (2 load + 1 store of 32-byte YMM vectors) down to ~81 bytes / cycle for Skylake according to a table in Intel's optimization manual. But under the right circumstances, Skylake can sustain 2 loads + 1 store per clock of 4-byte operands, so maybe that 81-byte / cycle number is limited by some other microarchitectural limit. The peak is 96B/clock, but apparently that can't happen back-to-back indefinitely.
One downside to stopping store-address uops from running on p23 is that it would take longer for store addresses to be known, maybe delaying loads more.
I can't understand when store-forwarding on the data (aka: STD / Port 4) would ever be done.
A store/reload can have the load take the data from the store buffer, instead of waiting for it to commit to L1D and reading it from there.
How does store to load forwarding happens in case of unaligned memory access?
Store-to-Load Forwarding and Memory Disambiguation in x86 Processors
Store/reload can happen when a function spills some registers before calling a function, of as part of passing args on the stack (especially with crappy stack-args calling conventions that pass all args on the stack). Or passing something by reference to a non-inline function. Or in a histogram, if the same bin is hit repeatedly, you're basically doing a memory-destination increment in a loop.
Its been a few days without a response, so here's my best guess at "answering my own question".
The raw x86 instruction set isn't executed directly by modern processors. Instead, the x86 instruction set is "compiled" down into Micro-ops (uOps) before being executed by the Intel core. This shouldn't be too surprising, because some x86 instructions can be complex. An example taken from the optimization guide is as follows:
Similarly, the following store instruction has three register sources and is broken into "generate store
address" and "generate store data" sub-components.
MOV [ESP+ECX*4+12345678], AL
This is currently found on page 50 of the optimization manual (2.3.2.4 Micro-op Queue and the Loop Stream Detector (LSD)).
In this case, the address of the store operation is complex, so it is its own uOp. So at very least, this singular x86 instruction gets converted into two uOps internally. The names of these two uOps are "Store Address" and "Store Data". The manual doesn't describe the internal uOps at all, so it may take even more than two uOps to accomplish.
Since there's only one "store data" port on Skylake systems, that means that Skylake can only modify at most one memory location per cycle. The three "Store Address" ports means that Skylake can calculate the effective address of many instructions simultaneously (possibly because some very complicated addresses may take more than one uOp to execute??).

micro-programmed control circuit and one questions

I ran into a question:
in digital system with micro-programmed control circuit, total of distinct operation pattern of 32 signal is 450. if the micro-programmed memory contains 1K micro instruction, by using Nano memory, how many bits is reduced from micro-programmed memory?
1) 22 Kbits
2) 23 Kbits
3) 450 Kbits
4) 450*32 Kbits
I read in my notes, that (1) is true, but i couldn't understand how we get this?
Edit: Micro instructions are stored in the micro memory (control memory). There is a chance that a group of micro instructions may occur several times in a micro program. As a result the more memory space isneeded.By making use of the nano memory we can have significant saving in the memory when a group of micro operations occur several times in a micro program. Please see for nano technique ref:
Control Units
Back in the day, before .NET, when you actually had to know what a computer was, before you could make it do stuff. This question would have gotten a ton of answers.
Except, back then, the internet wasn't really a thing, and Stack overflow was not really a problem, as the concept of a stack and a heap, wasn't really a standard..
So just to make sure that we are in fact talking about the same thing, I will just tr to explain this..
The control unit in a digital computer initiates sequences of microoperations. In a bus-oriented system, the control signals that specify microoperations are
groups of bits that select the paths in multiplexers, decoders, and ALUs.
So we are looking at the control unit, and the instruction set for making it capable of actually doing stuff.
We are dealing with what steps should happen, when the compiled assembly requests a bit shift, clear a register, or similar "low level" stuff.
Some of theese instructions may be hardwired, but usually not all of them.
Micro-programs
Quote: "Microprogramming is an orderly method of designing the control unit
of a conventional computer"
(http://www2.informatik.hu-berlin.de/rok/ca/data/slides/english/ca9.pdf)
The control variables, for the control unit can be represented by a string of 1’s and 0’s called a "control word". A microprogrammed control unit is a control unit whose binary control variables are not hardwired, but are stored in a memory. Before we optimized stuff we called this memory the micro memory ;)
Typically we would actually be looking at two "memories" a control memory, and a main memory.
the control memory is for the microprogram,
and the main memory is for instructions and data
The process of code generation for the control memory is called
microprogramming.
... ok?
Transfer of information among registers in the processor is through MUXs rather
than a bus, we typically have a few register, some of which are familiar to programmers, some are not. The ones that should ring a bell for most in here, is the processor registers. The most common 4 Processor registers are:
Program counter – PC
Address register – AR
Data register – DR
Accumulator register - AC
Examples where microcode uses processor registers to do stuff
Assembly instruction "ADD"
pseudo micro code: " AC ← AC + M[EA] " where M[EA] is data from main memory register
control word: 0000
Assembly instruction "BRANCH"
pseudo micro code "If (AC < 0) then (PC ← EA) "
control word: 0001
Micro-memory
The micro memory only concerns how we organize whats in the control memory.
However when we have big instruction sets, we can do better than simply storing all the instructions. We can subdivide the control memory into "control memory" and "nano memory" (since nano is smaller than micro right ;) )
This is good as we don't waste a lot of valuable space (chip area) on microcode.
The concept of nano memory is derived from a combination of vertical and horizontal instructions, but also provides trade-offs between them.
The motorola M68k microcomputer is one the earlier and popular µComputers with this nano memory control design. Here it was shown that a significant saving of memory could be achieved when a group of micro instructions occur often in a microprogram.
Here it was shown that by structuring the memory properly, that a few bits could be used to address the instructions, without a significant cost to speed.
The reduction was so that only the upper log_2(n) bits are required to specify the nano-address, when compared to the micro-address.
what does this mean?
Well let's stay with the M68K example a bit longer:
It had 640 instructions, out of which only 280 where unique.
had the instructions been coded as simple micro memory, it would have taken up:
640x70 bits. or 44800 bits
however, as only the 280 unique instructions where required to fill all 70 bits, we could apply the nano memory technique to the remaining instructions, and get:
8 < log_2(640-280) < 9 = 9
640*9 bit micro control store, and 280x70 bit nano memory store
total of 25360 bits
or a memory savings of 19440 bits.. which could be laid out as main memory for programmers :)
this shows that the equation:
S = Hm x Wm + Hn x Wn
where:
Hm = Number of words High Level
Wm = Length of words in High Level
Hn = Number of Low Level words
Wn = Length of low level words
S = Control Memory Size (with Nano memory technique)
holds in real life.
note that, micro memory is usually designed vertically (Hm is large, Wm is small) and nano programs are usually opposite Hn small, Wn Large.
Back to the question
I had a few problems understanding the wording of the problem, - that may because my first language is Danish, but still I tried to make some sense of it and got to:
proposition 1:
1000 instructions
32 bits
450 uniques
µCode:
1000 * 32 = 32.000 bits
bit width required for nano memory:
log2(1000-450) > 9 => 10
450 * 32 = 14400
(1000-450) * 10 = 5500
32000 - (14400 + 5500) = 12.100 bits saved
Which is not any of your answers.
please provide clarification?
UPDATE:
"the control word is 32 bit. we can code the 450 pattern with 9 bit and we use these 9 bits instead of 32 bit control word. reduce memory from 1000*(32+x) to 1000*(9+x) is equal to 23kbits. – Ali Movagher"
There is your problem, we cannot code the 450 pattern with 9 bits, as far as I can see we need 10..

What makes a CPU architecture "X-bit"?

Warning: I'm not sure where this type of question belongs. If you know a better place for it, drop a link.
Background: Imagine you heard a sentence like this: "this computer/processor has X-bit architecture". Now, if that computer is standard, you get a lot of information, like maximum RAM capacity, maximum unsigned/signed integer value and so on... But what if computer is not standard?
The mystery: back to 70's and 80's, the period referred as "8-bit era". Wait, 8-bit? Yes. So, if a CPU architecture is 8-bit, then:
The maximum RAM capacity of computer is exactly 256 bytes.
The maximum UInt range is from 0 to 256 and the maximum signed integer range is -128 to 127.
The maximum ROM capacity is also 256 bytes, because you have to be able to jump around?
However, it's clearly not like that. Look at some technical characteristics of game consoles of that time and you will see that those exceed the 256 limit.
Quotes (http://www.8bitcomputers.co.uk/whatbasics.html):
The Sharp PC1211 is actually a 4-bit computer but cleverly glues two together to look like 8 (a computer able to add up to 16 would not be very useful!)
So if it's a 4-bit computer, why can manipulate 8-bit integers? And another one...
The Sinclair QL is one of those computers that actually leaves the experts arguing. In parts, it is a 16 bit computer, in some ways it is even like a 32 bit computer but it holds its memory in 8 bits.
What? So why is this mess in www.8bitcomputers.co.uk?
Generally: how is an X-bit computer defined?
The biggest data bus that it has is X bits long (then Sinclair QL is a 32-bit computer)?
The CU functions of that computer are X bits long?
It holds its memory (in registers, ROM, RAM, whatever) in 8 bits?
Other definitions?
Purpose: I think that what I am designing is a 4-bit CPU. I don't really know if it has a 4-bit architecture, because it uses double ROM address, and includes functions like "activate ALU" that take another 4 bits from register Y. I want to know if I can still call it a 4-bit CPU. That's it!
Thank you very much in advance :)
An X-bit computer (or CPU) is defined whether the central unites and registers, such as CPU and ALU, are in X-bit. The addressing doesn't matter in defining the number X. As you have mentioned, an 8-bit computer (e.g. Motorola 68HC11 even tough it is a MCU, still it can be counted as a computer with CPU, I/O and Memory) can have 16-bit addressing in order to increase the RAM or memory size.
The data-bus size and the register sizes of CPU and ALU is the limiting factor in defining the X number in an X-bit computer architecture. You can get more information from http://en.wikipedia.org/wiki/Word_(computer_architecture)
An answer to your question will be "Yes, you are designing a 4-bit CPU if the registers and data bus size are in 4-bit.

why is data structure alignment important for performance?

Can someone give me a short and plausible explanation for why the compiler adds padding to data structures in order to align its members? I know that it's done so that the CPU can access the data more efficiently, but I don't understand why this is so.
And if this is only CPU related, why is a double 4 byte aligned in Linux and 8 byte aligned in Windows?
Alignment helps the CPU fetch data from memory in an efficient manner: less cache miss/flush, less bus transactions etc.
Some memory types (e.g. RDRAM, DRAM etc.) need to be accessed in a structured manner (aligned "words" and in "burst transactions" i.e. many words at one time) in order to yield efficient results. This is due to many things amongst which:
setup time: time it takes for the memory devices to access the memory locations
bus arbitration overhead i.e. many devices might want access to the memory device
"Padding" is used to correct the alignment of data structures in order to optimize transfer efficiency.
In other words, accessing a "mis-aligned" structure will yield lower overall performance. A good example of such pitfall: suppose a data structure is mis-aligned and requires the CPU/Memory Controller to perform 2 bus transactions (instead of 1) in order to fetch the said structure, the performance is thus consequently lower.
the CPU fetches data from memory in groups of 4 bytes (it actualy depends on the hardware its 8 or other values for some types of hardware, but lets stick with 4 to keep it simple),
all is well if the data begins in an address which is dividable by 4, the CPU goes to the memory address and loads the data.
now suppose the data begins in an address not dividable by 4 say for the sake of simplicity at address 1, the CPU must take data from address 0 and then apply some algorithm to dump the byte at the 0 address , to gain access to the actual data at byte 1. this takes time and therefore lowers preformance. so it is much more efficient to have all data addresses aligned.
A cache line is a basic unit of caching. Typically it is 16-64 bytes or more.
Pentium IV: 64 bytes; Pentium Pro/II: 32 bytes; Pentium I: 32 bytes; 486: 16 bytes.
myrandomreader:
; ...
; ten instructions to generate next pseudo-random
; address in ESI from previous address
; ...
MOV EAX, DS:[ESI] ; X
LOOP myrandomreader
For memory read straddling two cachelines:
(for L1 cache miss) the processor must wait for the whole of cache line 1 to be read from L2->L1 into the processor before it can request the second cache line, causing a short execution stall
(for L2 cache miss) the processor must wait for two burst reads from L3 cache (if present) or main memory to complete rather than one
Processor stalls
A random 4 byte read will straddle a cacheline boundary about 5% of the time for 64 byte cachelines, 10% for 32 byte ones and 20% for 16 byte ones.
There may be additional execution overheads for some instructions on misaligned data even if it is within a cacheline. This is talked about on the Intel website for some SSE instructions.
If you are defining the structures yourself, it may make sense to look at listing all the <32bit data fields together in a struct so that padding overhead is reduced or alternatively review whether it is better to turn packing on or off for a particular structure.
On MIPS and many other platforms you don't get the choice and must align - kernel exception if you don't!!
Alignment may also matter extra specially to you if you are doing I/O on the bus or using atomic operations such as atomic increment/decrement or if you wish to be able to port your code to non-Intel.
On Intel only (!) code, a common practice is to define one set of packed structures for network and disk, and another padded set for in-memory and to have routines to convert data between these formats (also consider "endianness" for the disk and network formats).
In addition to jldupont's answer, some architectures have load and store instructions (those used to read/write to and from memory) that only operate on word aligned boundaries - so, to load a non-aligned word from memory would take two load instructions, a shift instruction, and then a mask instruction - much less efficient!

Resources