micro-programmed control circuit and one questions

micro-programmed control circuit and one questions - memory-management

I ran into a question:
in digital system with micro-programmed control circuit, total of distinct operation pattern of 32 signal is 450. if the micro-programmed memory contains 1K micro instruction, by using Nano memory, how many bits is reduced from micro-programmed memory?
1) 22 Kbits
2) 23 Kbits
3) 450 Kbits
4) 450*32 Kbits
I read in my notes, that (1) is true, but i couldn't understand how we get this?
Edit: Micro instructions are stored in the micro memory (control memory). There is a chance that a group of micro instructions may occur several times in a micro program. As a result the more memory space isneeded.By making use of the nano memory we can have significant saving in the memory when a group of micro operations occur several times in a micro program. Please see for nano technique ref:

Control Units
Back in the day, before .NET, when you actually had to know what a computer was, before you could make it do stuff. This question would have gotten a ton of answers.
Except, back then, the internet wasn't really a thing, and Stack overflow was not really a problem, as the concept of a stack and a heap, wasn't really a standard..
So just to make sure that we are in fact talking about the same thing, I will just tr to explain this..
The control unit in a digital computer initiates sequences of microoperations. In a bus-oriented system, the control signals that specify microoperations are
groups of bits that select the paths in multiplexers, decoders, and ALUs.
So we are looking at the control unit, and the instruction set for making it capable of actually doing stuff.
We are dealing with what steps should happen, when the compiled assembly requests a bit shift, clear a register, or similar "low level" stuff.
Some of theese instructions may be hardwired, but usually not all of them.
Micro-programs
Quote: "Microprogramming is an orderly method of designing the control unit
of a conventional computer"
(http://www2.informatik.hu-berlin.de/rok/ca/data/slides/english/ca9.pdf)
The control variables, for the control unit can be represented by a string of 1’s and 0’s called a "control word". A microprogrammed control unit is a control unit whose binary control variables are not hardwired, but are stored in a memory. Before we optimized stuff we called this memory the micro memory ;)
Typically we would actually be looking at two "memories" a control memory, and a main memory.
the control memory is for the microprogram,
and the main memory is for instructions and data
The process of code generation for the control memory is called
microprogramming.
... ok?
Transfer of information among registers in the processor is through MUXs rather
than a bus, we typically have a few register, some of which are familiar to programmers, some are not. The ones that should ring a bell for most in here, is the processor registers. The most common 4 Processor registers are:
Program counter – PC
Address register – AR
Data register – DR
Accumulator register - AC
Examples where microcode uses processor registers to do stuff
Assembly instruction "ADD"
pseudo micro code: " AC ← AC + M[EA] " where M[EA] is data from main memory register
control word: 0000
Assembly instruction "BRANCH"
pseudo micro code "If (AC < 0) then (PC ← EA) "
control word: 0001
Micro-memory
The micro memory only concerns how we organize whats in the control memory.
However when we have big instruction sets, we can do better than simply storing all the instructions. We can subdivide the control memory into "control memory" and "nano memory" (since nano is smaller than micro right ;) )
This is good as we don't waste a lot of valuable space (chip area) on microcode.
The concept of nano memory is derived from a combination of vertical and horizontal instructions, but also provides trade-offs between them.
The motorola M68k microcomputer is one the earlier and popular µComputers with this nano memory control design. Here it was shown that a significant saving of memory could be achieved when a group of micro instructions occur often in a microprogram.
Here it was shown that by structuring the memory properly, that a few bits could be used to address the instructions, without a significant cost to speed.
The reduction was so that only the upper log_2(n) bits are required to specify the nano-address, when compared to the micro-address.
what does this mean?
Well let's stay with the M68K example a bit longer:
It had 640 instructions, out of which only 280 where unique.
had the instructions been coded as simple micro memory, it would have taken up:
640x70 bits. or 44800 bits
however, as only the 280 unique instructions where required to fill all 70 bits, we could apply the nano memory technique to the remaining instructions, and get:
8 < log_2(640-280) < 9 = 9
640*9 bit micro control store, and 280x70 bit nano memory store
total of 25360 bits
or a memory savings of 19440 bits.. which could be laid out as main memory for programmers :)
this shows that the equation:
S = Hm x Wm + Hn x Wn
where:
Hm = Number of words High Level
Wm = Length of words in High Level
Hn = Number of Low Level words
Wn = Length of low level words
S = Control Memory Size (with Nano memory technique)
holds in real life.
note that, micro memory is usually designed vertically (Hm is large, Wm is small) and nano programs are usually opposite Hn small, Wn Large.
Back to the question
I had a few problems understanding the wording of the problem, - that may because my first language is Danish, but still I tried to make some sense of it and got to:
proposition 1:
1000 instructions
32 bits
450 uniques
µCode:
1000 * 32 = 32.000 bits
bit width required for nano memory:
log2(1000-450) > 9 => 10
450 * 32 = 14400
(1000-450) * 10 = 5500
32000 - (14400 + 5500) = 12.100 bits saved
Which is not any of your answers.
please provide clarification?
UPDATE:
"the control word is 32 bit. we can code the 450 pattern with 9 bit and we use these 9 bits instead of 32 bit control word. reduce memory from 1000*(32+x) to 1000*(9+x) is equal to 23kbits. – Ali Movagher"
There is your problem, we cannot code the 450 pattern with 9 bits, as far as I can see we need 10..

Related

How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows?

I am on the hook to analyze some "timing channels" of some x86 binary code. I am posting one question to comprehend the bsf/bsr opcodes.
So high-levelly, these two opcodes can be modeled as a "loop", which counts the leading and trailing zeros of a given operand. The x86 manual has a good formalization of these opcodes, something like the following:
IF SRC = 0
THEN
ZF ← 1;
DEST is undefined;
ELSE
ZF ← 0;
temp ← OperandSize – 1;
WHILE Bit(SRC, temp) = 0
DO
temp ← temp - 1;
OD;
DEST ← temp;
FI;
But to my suprise, bsf/bsr instructions seem to have fixed cpu cycles. According to some documents I found here: https://gmplib.org/~tege/x86-timing.pdf, seems that they always take 8 CPU cycles to finish.
So here are my questions:
I am confirming that these instructions have fixed cpu cycles. In other words, no matter what operand is given, they always take the same amount of time to process, and there is no "timing channel" behind. I cannot find corresponding specifications in Intel's official documents.
Then why it is possible? Apparently this is a "loop" or somewhat, at least high-levelly. What is the design decision behind? Easier for CPU pipelines?

BSF/BSR performance is not data dependent on any modern CPUs. See https://agner.org/optimize/, https://uops.info/, or http://instlatx64.atw.hu/ for experimental timing results, as well as the https://gmplib.org/~tege/x86-timing.pdf you found.
On modern Intel, they decode to 1 uop with 3 cycle latency and 1/clock throughput, running only on port 1. Ryzen also runs them with 3c latency for BSF, 4c latency for BSR, but multiple uops. Earlier AMD is sometimes even slower.
(Prefer rep bsf aka tzcnt in code that might run on AMD CPUs, if you don't need the FLAGS difference between bsf and tzcnt for zero inputs. lzcnt and tzcnt are fast on AMD as well, like 1 cycle latency with 3/clock throughput for lzcnt on Zen 2 (https://uops.info/). Unfortunately lzcnt and bsr aren't compatible that way, so you can't use it in an "optimistic" forward-compatible way, you have to know which you're getting.)
Your "8 cycle" (latency and throughput) cost appears to be for 32-bit BSF on AMD K8, from Granlund's table that you linked. Agner Fog's table agrees, (and shows it decodes to 21 uops instead of having a dedicated bit-scan execution unit. But the microcoded implementation is presumably still branchless and not data-dependent). No clue why you picked that number; K8 doesn't have SMT / Hyperthreading so the opportunity for an ALU-timing side channel is much reduced.
Do note that they have an output dependency on the destination register, which they leave unmodified if the input was zero. AMD documents this behaviour, Intel implements it in hardware but documents it as an "undefined" result, so unfortunately compilers won't take advantage of it and human programmers maybe should be cautious. IDK if some ancient 32-bit only CPU had different behaviour, or if Intel is planning to ever change (doubtful!), but I wish Intel would document the behaviour at least for 64-bit mode (which excludes any older CPUs).
lzcnt/tzcnt and popcnt on Intel CPUs (but not AMD) have the same output dependency before Skylake and before Cannon Lake (respectively), even though architecturally the result is well-defined for all inputs. They all use the same execution unit. (How is POPCNT implemented in hardware?). AMD Bulldozer/Ryzen builds their bit-scan execution unit without the output dependency baked in, so BSF/BSR are slower than LZCNT/TZCNT (multiple uops to handle the input=0 case, and probably also setting ZF according to the input, not the result).
(Taking advantage of that with intrinsics isn't possible; not even with MSVC's _BitScanReverse64 which uses a by-reference output arg that you could set first. MSVC doesn't respect the previous value and assumes it's output-only. VS: unexpected optimization behavior with _BitScanReverse64 intrinsic)
The pseudocode in the manual is not the implementation
(i.e. it's not necessarily how hardware or microcode works).
It gives precisely the same result in all cases, so you can use it to understand exactly what will happen for any corner cases the text leaves you wondering about. That is all.
The point is to be simple and easy to understand, and that means modeling things in terms of simple 2-input operations which happen serially. C / Fortran / typical pseudocode doesn't have operators for many-input AND, OR, or XOR, but you can build that in hardware up to a point (limited by fan-in, the opposite of fan-out).
Integer addition can be modelled as bit-serial ripple carry, but that's not how it's implemented! Instead, we get single-cycle latency for 64-bit addition with far fewer than 64 gate delays using tricks like carry lookahead adders.
The actual implementation techniques used in Intel's bit-scan / popcnt execution unit are described in US Patent US8214414 B2.
Abstract
A merged datapath for PopCount and BitScan is described. A hardware
circuit includes a compressor tree utilized for a PopCount function,
which is reused by a BitScan function (e.g., bit scan forward (BSF) or
bit scan reverse (BSR)).
Selector logic enables the compressor tree to
operate on an input word for the PopCount or BitScan operation, based
on a microprocessor instruction. The input word is encoded if a
BitScan operation is selected.
The compressor tree receives the input
word, operates on the bits as though all bits have same level of
significance (e.g., for an N-bit input word, the input word is treated
as N one-bit inputs). The result of the compressor tree circuit is a
binary value representing a number related to the operation performed
(the number of set bits for PopCount, or the bit position of the first
set bit encountered by scanning the input word).
It's fairly safe to assume that Intel's actual silicon works similarly to this. Other Intel patents for things like out-of-order machinery (ROB, RS) do tend to match up with performance experiments we can perform.
AMD may do something different, but regardless we know from performance experiments that it's not data-dependent.
It's well known that fixed latency is a hugely beneficial thing for out-of-order scheduling, so it's very surprising when instructions don't have fixed latency. Sandybridge even went so far as to standardize uop latencies to simplify the scheduler and reduce the opportunities write-back conflicts. (e.g. a 3-cycle latency uop followed by a 2-cycle latency uop to the same port would produce 2 results in the same cycle). This meant making complex-LEA (with all 3 components: [disp + base + idx*scale]) take 3 cycles instead of just 2 for the 2 additions like on previous CPUs. There are no 2-cycle latency uops on Sandybridge-family. (There are some 2-cycle latency instructions, because they decode to 2 uops with 1c latency each. The scheduler schedules uops, not instructions).
One of the few exceptions to the rule of fixed latency for ALU uops is division / sqrt, which uses a not-fully-pipelined execution unit. Division is inherently iterative, unlike multiplication where you can make wide hardware that does the partial products and partial additions in parallel.
On Intel CPUs, variable-latency for L1d cache access can produce replays of dependent uops if the data wasn't ready when the scheduler optimistically hoped it would be.
Is there a penalty when base+offset is in a different page than the base?
Why does the number of uops per iteration increase with the stride of streaming loads?
Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?

The 80x86 manual has a good description of the expected behavior, but that has nothing to do with how it's actually implemented in silicon in any model from any manufacturer.
Let's say that there's been 50 different CPU designs from Intel, 25 CPU designs from AMD, then 25 more from other manufacturers (VIA, Cyrix, SiS/Vortex, NSC, ...). Out of those 100 different CPU designs, maybe there's 20 completely different ways that BSF has been implemented, and maybe 10 of them have fixed timing, 5 have timing that depends on every bit of the source operand, and 5 depend on groups of bits of the source operand (e.g. maybe like "if highest 32 bits of 64-bit operand are zeros { switch to 32-bit logic that's 2 cycles faster }").
I am confirming that these instructions have fixed cpu cycles. In other words, no matter what operand is given, they always take the same amount of time to process, and there is no "timing channel" behind. I cannot find corresponding specifications in Intel's official documents.
You can't. More specifically, you can test or research existing CPUs, but that's a waste of time because next week Intel (or AMD or VIA or someone else) can release a new CPU that has completely different timing.
As soon as you rely on "measured from existing CPUs" you're doing it wrong. You have to rely on "architectural guarantees" that apply to all future CPUs. There is no "architectural guarantee". You have to assume that there may be a timing side-channel (even if there isn't for current CPUs)
Then why it is possible? Apparently this is a "loop" or somewhat, at least high-levelly. What is the design decision behind? Easier for CPU pipelines?
Instead of doing a 64-bit BSF, why not split it into a pair of 32-bit pieces and do them in parallel, then merge the results? Why not split it into eight 8-bit pieces? Why not use a table lookup for each 8-bit piece?

The answers posted have explained well that the implementation is different from pseudocode. But if you are still curious why the latency is fixed and not data dependent or uses any loops for that matter, you need to see electronic side of things.
One way you could implement this feature in hardware is by using a Priority encoder.
A priority encoder will accept n input lines that can be one or off (0 or 1) and give out the index of the highest priority line that is on. Below is a table from the linked Wikipedia article modified for a most significant set bit function.
input | output index of first set bit
0000 | xx undefined
0001 | 00 0
001x | 01 1
01xx | 10 2
1xxx | 11 3
x denotes the bit value does not matter and can be anything
If you see the circuit diagram on the article, there are no loops of any kind, it is all parallel.

Why do bytes exist? Why don't we just use bits?

A byte consists of 8 bits on most systems.
A byte typically represents the smallest data type a programmer may use. Depending on language, the data types might be called char or byte.
There are some types of data (booleans, small integers, etc) that could be stored in fewer bits than a byte. Yet using less than a byte is not supported by any programming language I know of (natively).
Why does this minimum of using 8 bits to store data exist? Why do we even need bytes? Why don't computers just use increments of bits (1 or more bits) rather than increments of bytes (multiples of 8 bits)?
Just in case anyone asks: I'm not worried about it. I do not have any specific needs. I'm just curious.

because at the hardware level memory is naturally organized into addressable chunks. Small chunks means that you can have fine grained things like 4 bit numbers; large chunks allow for more efficient operation (typically a CPU moves things around in 'chunks' or multiple thereof). IN particular larger addressable chunks make for bigger address spaces. If I have chunks that are 1 bit then an address range of 1 - 500 only covers 500 bits whereas 500 8 bit chunks cover 4000 bits.
Note - it was not always 8 bits. I worked on a machine that thought in 6 bits. (good old octal)

Paper tape (~1950's) was 5 or 6 holes (bits) wide, maybe other widths.
Punched cards (the newer kind) were 12 rows of 80 columns.
1960s:
B-5000 - 48-bit "words" with 6-bit characters
CDC-6600 -- 60-bit words with 6-bit characters
IBM 7090 -- 36-bit words with 6-bit characters
There were 12-bit machines; etc.
1970-1980s, "micros" enter the picture:
Intel 4004 - 4-bit chunks
8008, 8086, Z80, 6502, etc - 8 bit chunks
68000 - 16-bit words, but still 8-bit bytes
486 - 32-bit words, but still 8-bit bytes
today - 64-bit words, but still 8-bit bytes
future - 128, etc, but still 8-bit bytes
Get the picture? Americans figured that characters could be stored in only 6 bits.
Then we discovered that there was more in the world than just English.
So we floundered around with 7-bit ascii and 8-bit EBCDIC.
Eventually, we decided that 8 bits was good enough for all the characters we would ever need. ("We" were not Chinese.)
The IBM-360 came out as the dominant machine in the '60s-70's; it was based on an 8-bit byte. (It sort of had 32-bit words, but that became less important than the all-mighty byte.
It seemed such a waste to use 8 bits when all you really needed 7 bits to store all the characters you ever needed.
IBM, in the mid-20th century "owned" the computer market with 70% of the hardware and software sales. With the 360 being their main machine, 8-bit bytes was the thing for all the competitors to copy.
Eventually, we realized that other languages existed and came up with Unicode/utf8 and its variants. But that's another story.

Good way for me to write something late on night!
Your points are perfectly valid, however, history will always be that insane intruder how would have ruined your plans long before you were born.
For the purposes of explanation, let's imagine a ficticious machine with an architecture of the name of Bitel(TM) Inside or something of the like. The Bitel specifications mandate that the Central Processing Unit (CPU, i.e, microprocessor) shall access memory in one-bit units. Now, let's say a given instance of a Bitel-operated machine has a memory unit holding 32 billion bits (our ficticious equivalent of a 4GB RAM unit).
Now, let's see why Bitel, Inc. got into bankruptcy:
The binary code of any given program would be gigantic (the compiler would have to manipulate every single bit!)
32-bit addresses would be (even more) limited to hold just 512MB of memory. 64-bit systems would be safe (for now...)
Memory accesses would be literally a deadlock. When the CPU has got all of those 48 bits it needs to process a single ADD instruction, the floppy would have already spinned for too long, and you know what happens next...
Who the **** really needs to optimize a single bit? (See previous bankruptcy justification).
If you need to handle single bits, learn to use bitwise operators!
Programmers would go crazy as both coffee and RAM get too expensive. At the moment, this is a perfect synonym of apocalypse.
The C standard is holy and sacred, and it mandates that the minimum addressable unit (i.e, char) shall be at least 8 bits wide.
8 is a perfect power of 2. (1 is another one, but meh...)

In my opinion, it's an issue of addressing. To access individual bits of data, you would need eight times as many addresses (adding 3 bits to each address) compared to using accessing individual bytes. The byte is generally going to be the smallest practical unit to hold a number in a program (with only 256 possible values).

Some CPUs use words to address memory instead of bytes. That's their natural data type, so 16 or 32 bits. If Intel CPUs did that it would be 64 bits.
8 bit bytes are traditional because the first popular home computers used 8 bits. 256 values are enough to do a lot of useful things, while 16 (4 bits) are not quite enough.
And, once a thing goes on for long enough it becomes terribly hard to change. This is also why your hard drive or SSD likely still pretends to use 512 byte blocks. Even though the disk hardware does not use a 512 byte block and the OS doesn't either. (Advanced Format drives have a software switch to disable 512 byte emulation but generally only servers with RAID controllers turn it off.)
Also, Intel/AMD CPUs have so much extra silicon doing so much extra decoding work that the slight difference in 8 bit vs 64 bit addressing does not add any noticeable overhead. The CPU's memory controller is certainly not using 8 bits. It pulls data into cache in long streams and the minimum size is the cache line, often 64 bytes aka 512 bits. Often RAM hardware is slow to start but fast to stream so the CPU reads kilobytes into L3 cache, much like how hard drives read an entire track into their caches because the drive head is already there so why not?

First of all, C and C++ do have native support for bit-fields.
#include <iostream>
struct S {
// will usually occupy 2 bytes:
// 3 bits: value of b1
// 2 bits: unused
// 6 bits: value of b2
// 2 bits: value of b3
// 3 bits: unused
unsigned char b1 : 3, : 2, b2 : 6, b3 : 2;
};
int main()
{
std::cout << sizeof(S) << '\n'; // usually prints 2
}
Probably an answer lies in performance and memory alignment, and the fact that (I reckon partly because byte is called char in C) byte is the smallest part of machine word that can hold a 7-bit ASCII. Text operations are common, so special type for plain text have its gain for programming language.

Why bytes?
What is so special about 8 bits that it deserves its own name?
Computers do process all data as bits, but they prefer to process bits in byte-sized groupings. Or to put it another way: a byte is how much a computer likes to "bite" at once.
The byte is also the smallest addressable unit of memory in most modern computers. A computer with byte-addressable memory can not store an individual piece of data that is smaller than a byte.
What's in a byte?
A byte represents different types of information depending on the context. It might represent a number, a letter, or a program instruction. It might even represent part of an audio recording or a pixel in an image.
Source

AXI4 (Lite) Narrow Burst vs. Unaligned Burst Clarification/Compatibility

I'm currently writing an AXI4 master that is supposed to support AXI4 Lite (AXI4L) as well.
My AXI4 master is receiving data from a 16-bit interface. This is on a Xilinx Spartan 6 FPGA and I plan on using the EDK AXI4 Interconnect IP, which has a minimum WDATA width of 32 bits.
At first I wanted to use narrow burst, i.e. AWSIZE = x"01" (2 bytes in transfer). However, I found that Xilinx' AXI Reference Guide UG761 states "narrow bursts [are] supported but [...] not recommended." Unaligned transactions are supposed to be supported.
This had me thinking. Say I start an unaligned burst:
AWLEN = x"01" (2 beats)
AWSIZE = x"02" (4 bytes in transfer")
And do the following:
AX (32-bit word #0: send hi16)
XB (32-bit word #1: send lo16)
Where A, B are my 16 bit words that start off at an unaligned (2-byte aligned) address. X means WSTRB is deasserted for the indicated 16 bit.
Is this supported or does this fall under the category "narrow burst" even through AWSIZE = x"02" (4 bytes in transfer) as opposed to AWSIZE = x"01" (2 bytes in transfer)?
Now, if this was just for AXI4, I would probably not care as much about this use case, because AXI4 peripherals are required to use the WSTRB signals. However, the AXI Reference Guide UG761 states "[AXI4L] Slaves interface can elect to ignore WSTRB (assume all bytes valid)."
I read here that many (but not all; and there is not list?) Xilinx AXI4L peripherals do elect to ignore WSTRB.
Does this mean that I'm essentially barred from doing narrow burst ("not recommended") as well as unaligned bursts ("WSTRB can be ignored") or is there an easy way to unload some of the implementation work from my master into the interconnect, guaranteeing proper system behavior when accessing AXI4L peripherals?

Your example is not a narrow burst, and should work.
The reason narrow burst is not recommended is that it gives sub-optimal performances. Both narrow-burst and data realignement cost in area and are not recommended IMHO. However, DRE has minimal bandwidth cost, while narrow burst does. If your AXI port is 100MHz 32 bits, you have 3.2GBits maximum throughput, if you use narrow burst of 16 bits 50% of the time, than your maximum throughput is reduced to 2.4GBits (32bits X 50MHz + 16bits X 50Mhz). Also, I'm not sure AXI-Lite support narrow burst or data realignement.
Your example has 2 major flaws. First, it requires 3 data-beats to transfer 32 bits, which is worst than narrow-burst (I don't think AXI is smart enough to cancel the last burst with WSTRB to 0). Second, you can't burst more than 2 16-bits at a time, which will hang your AXI infrastructure's performances if you have a lot of data to transfer.
The best way to deal with this is concatenate the 16 bits together to form a 32 bits in your block. Then you buffer these 32 bits and burst them when you have enough. This is the AXI high performance way to do this.
However, if you receive data as 16-bits, it seems you would be better using AXI-Stream, which support 16-bits but doesn't have the notion of addresses. You can map an AXI-Stream to AXI-4 using Xilinx's IP cores. Either AXI-Datamover or AXI-DMA can do that. Both do the same (in fact, AXI-DMA includes a datamover), but AXI-DMA is controlled trough an AXI-Lite interface while Datamover is controlled through additionals AXI-Streams.
As a final note, the Xilinx cores never requires narrow-burst or DRE. If you need DRE in AXI-DMA, it's done by the AXI-DMA core and not the AXI Interconnect. Also, these cores are clear-source, so you can checkout how they operate easily.

What makes a CPU architecture "X-bit"?

Warning: I'm not sure where this type of question belongs. If you know a better place for it, drop a link.
Background: Imagine you heard a sentence like this: "this computer/processor has X-bit architecture". Now, if that computer is standard, you get a lot of information, like maximum RAM capacity, maximum unsigned/signed integer value and so on... But what if computer is not standard?
The mystery: back to 70's and 80's, the period referred as "8-bit era". Wait, 8-bit? Yes. So, if a CPU architecture is 8-bit, then:
The maximum RAM capacity of computer is exactly 256 bytes.
The maximum UInt range is from 0 to 256 and the maximum signed integer range is -128 to 127.
The maximum ROM capacity is also 256 bytes, because you have to be able to jump around?
However, it's clearly not like that. Look at some technical characteristics of game consoles of that time and you will see that those exceed the 256 limit.
Quotes (http://www.8bitcomputers.co.uk/whatbasics.html):
The Sharp PC1211 is actually a 4-bit computer but cleverly glues two together to look like 8 (a computer able to add up to 16 would not be very useful!)
So if it's a 4-bit computer, why can manipulate 8-bit integers? And another one...
The Sinclair QL is one of those computers that actually leaves the experts arguing. In parts, it is a 16 bit computer, in some ways it is even like a 32 bit computer but it holds its memory in 8 bits.
What? So why is this mess in www.8bitcomputers.co.uk?
Generally: how is an X-bit computer defined?
The biggest data bus that it has is X bits long (then Sinclair QL is a 32-bit computer)?
The CU functions of that computer are X bits long?
It holds its memory (in registers, ROM, RAM, whatever) in 8 bits?
Other definitions?
Purpose: I think that what I am designing is a 4-bit CPU. I don't really know if it has a 4-bit architecture, because it uses double ROM address, and includes functions like "activate ALU" that take another 4 bits from register Y. I want to know if I can still call it a 4-bit CPU. That's it!
Thank you very much in advance :)

An X-bit computer (or CPU) is defined whether the central unites and registers, such as CPU and ALU, are in X-bit. The addressing doesn't matter in defining the number X. As you have mentioned, an 8-bit computer (e.g. Motorola 68HC11 even tough it is a MCU, still it can be counted as a computer with CPU, I/O and Memory) can have 16-bit addressing in order to increase the RAM or memory size.
The data-bus size and the register sizes of CPU and ALU is the limiting factor in defining the X number in an X-bit computer architecture. You can get more information from http://en.wikipedia.org/wiki/Word_(computer_architecture)
An answer to your question will be "Yes, you are designing a 4-bit CPU if the registers and data bus size are in 4-bit.

For ARM, why a single STM instruction is generally faster than multiple STR instructions?

Is it related to some prefetch technology?
Or with DDR access timing characteristics?

IIRC starting with ARMv5TE the path to the write buffer and the L1 caches is 64bit wide to accomodate the LDRD/STRD instructions. This allows STM to write two registers per cycle.
You'll also save a bit of L1 instruction-cache and use only one pipeline on dual-issue cores, which is also an additional gain.

More instructions, more fetch cycles, more instructions to execute, takes longer. The busses are or can be 64 bits wide, for a single register stm there is no gain but with more than one register there can be a reduction both in the number of bus cycles to move the data, and depending on the memory system, if 64 bits wide you dont have a read-modify-write which is slow as well. If it has to read-modify-write into the cache, where it normally would have been a write through, you lose cache space, as well as the cost of the read. Even if it is a hit in cache the read-modify write might cost you.
You can go to arms site and download the amba/axi spec and see how the bus transactions work, there are a number of clock cycles involved for every transaction (multiple transactions can be in flight at once, yes) once you get past the overhead it is a clock per 64 bits worth of data, so 128 bits takes one more clock than 64 bits to transfer. 32 bits and 64 bits take the same amount of clocks to transfer (if aligned).
I cant speak for all architectures but I believe on at least one I saw that only reads would actually do more than 64 bits per transfer. Writes were broken into separate 64 bit transfers. I could be remembering that wrong.
If you move 4 words worth of data, read or write, unaligned, I believe that becomes 4 separate transfers, one for each of the odd words and one for the aligned 64 bits in the middle. So alignment can matter.

When is that true?
According to this handy table, the STM instruction takes 2 cycles to store a single register, or n cycles to store n registers for n > 1.
On the other hand, STR always takes 1 single cycle.
When do you get that STM is faster than STR?
For one register, STM is slower.
For n registers (n > 1), they're the same.
On the other hand, the above reference is for the ARM9TDMI architecture, and there are many ARMs.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio