Can Registers inside a CPU do Arithmetics - cpu

I read in many detailed articles that Data from the Registers are used as Operands for the ALU to add two 32-bit integers, and this is only one small part of what the ALU can actually do.
However I also read the Register can even do arithmetic too? The difference between the two is quite blurred to me, what is the clear cut difference between a Register and the actual ALU component?
I know ALU doesn't store values, rather it receives it, and is instructed to simply do the Logic part, but the Register can both store and do general purpose stuff?
If the latter is true, then when does one use the ALU, and when does when use the General Purpose Registers?

Registers can't do arithmetic. A "register" is just a term for a place where you can stick a value. You can do arithmetic on the values stored in registers and have the results saved back into the register. This arithmetic would be done by the "ALU," which is the generic term for the portion of the processor that does number-crunching.
If you're still confused by something specific that you read, please quote it here or post a citation and someone can try to clarify. Note that "register" and "ALU" are very generic terms and are implemented and used differently in every architecture.

Although it is true that registers, properly speaking, don't do arithmetic, it IS possible to construct a circuit which both stores a number and, when a particular input line is set, increments that number. One could interpret this as a register which does a (very limited amount of) arithmetic. As it happens, there is generally no use for such a circuit in a general-purpose CPU (except maybe as the Program Counter), but such circuits could be useful in very simple digital controllers.
[note: historically, the first such circuits were made from vacuum tubes and used in Geiger Counters to count how many radioactive decay events occurred over short periods of time]

Registers don't do arithmetic. Modern cores have several “execution units” or “functional units” rather than an “ALU“. A reasonable rule of thumb is to discard any text (online or hardcopy) that speaks in terms of ALUs in the context of mainstream CPUs. “ALU” is still meaningful in the context of embedded systems, where a µC might actually have an ALU, but otherwise it's essentially an anachronism. If you see it, usually all it tells you is that the material is seriously out-of-date.

Related

Does it cost significant resources for a modern CPU to keep flags updated?

As I understand it, on a modern out of order CPU, one of the most expensive things is state, because that state has to be tracked in multiple versions, kept up-to-date across many instructions etc.
Some instruction sets like x86 and ARM make extensive use of flags, which were introduced when the cost model was not what what it is today, and the flags only cost a few logic gates. Things like every arithmetic instruction setting flags to detect zero, carry and overflow.
Are these particularly expensive to keep updated on a modern out of order implementation? Such that e.g. an ADD instruction updates the carry flag, and this must be tracked because although it will probably never be used, it is possible that some other instruction could use it N instructions later, with no fixed upper bound on N?
Are integer operations like addition and subtraction cheaper on instruction set architectures like MIPS that do not have these flags?
Various aspects of this are not very publicly known, so I will try to separate definitely known things from reasonable guesses and conjecture.
An approach has been to extend the (physical) integer registers (whether they take the form of a physical register file [eg P4 and SandyBridge+] or of results-in-ROB [eg P3]) with the flags that were produced by the operation that also produced the associated integer result. That's only about the arithmetic flags (sometimes AFLAGS, not to be confused with EFLAGS), but I don't think the "weird flags" are the focus of this question. Interestingly there is a patent[1] that hints at storing more than just the 6 AFLAGS themselves, putting some "combination flags" in there as well, but who know whether that was really done - most sources say the registers are extended by 6 bits, but AFAIK we (the public) don't really know. Lumping the integer result and associated flags together is described in for example this patent[2], which is primarily about preventing a certain situation where the flags might accidentally no longer be backed by any physical register. Aside from such quirks, during normal operation it has the nice effect of only needing to allocate 1 register for an arithmetic operation, rather than a separate main-result and flags-result, so renaming is normally not made much worse by the existence of the flags. Additionally, either the register alias table needs at least one more slot to keep track of which integer register contains the latest flags, or a separate flag-renaming-state buffer keeps track of the latest speculative flag state ([2] suggests Intel chose to separate them, which may simplify the main RAT but they don't go into such details). More slots may be used[3] to efficiently implement instructions which only update a subset of the flags (NetBurst™ famously lacked this, resulting in the now-stale advice to favour add over inc). Similarly, the non-speculative architectural state (whether it would be part of the retirement register file or be separate-but-similar again is not clear) needs at least one such slot.
A separate issue is computing the flags in the first place. [1] suggests separating flag generation from the main ALU simplifies the design. It's not clear to what degree they would be separated: the main ALU has to compute the Adjust and Sign flags anyway, and having an adder output a carry out the top is not much to ask (less than recomputing it from nothing). The overflow flag only takes an extra XOR gate to combine the carry into the top bit with the carry out of the top bit. The Zero flag and Parity flag are not for free though (and they depend on the result, not on the calculation of the result), if there is partial separation it would make sense that those would be computed separately. Perhaps it really is all separate. In NetBurst™, flag calculation took an extra half-cycle (the ALU was double-pumped and staggered)[4], but whether that means all flags are computed separately or a subset of them (or even a superset as [1] hinted) is not clear - the flags result is treated as monolithic so latency tests cannot distinguish whether a flag is computed in the third half-cycle by the flags unit or just handed to the flags unit by the ALU. In any case, typical ALU operations could be executed back-to-back, even if dependent (meaning that the high half of the first operation and the low half of the second operation ran in parallel), the delayed computation of the flags did not stand in the way of that. As you might expect though, ADC and SBB were not so efficient on NetBurst, but there may be other reasons for that too (for some reason a lot of µops are involved).
Overall I would conclude that the existence of arithmetic flags costs significant engineering resources to prevent them from having a significant performance impact, but that effort is also effective, so a significant impact is avoided.

How are Opcodes & Operands Defined in a CPU?

Extensive searching has sent me in a loop over the course of 3 days, so I'm depending on you guys to help me catch a break.
Why exactly does one 8-bit sequence of high's and low's perform this action, and 8-bit sequence performs that action.
My intuition tells me that the CPU's circuitry hard-wired one binary sequence to do one thing, and another to do another thing. That would mean different Processor's with potentially different chip circuitry wouldn't define one particular binary sequence as the same action as another?
Is this why we have assembly? I need someone to confirm and/or correct my hypothesis!
Opcodes are not always 8 bits but yes, it is hardcoded/wired in the logic to isolate the opcode and then send you down a course of action based on that. Think about how you would do it in an instruction set simulator, why would logic be any different? Logic is simpler than software languages, there is no magic there. ONE, ZERO, AND, OR, NOT thats as complicated as it gets.
Along the same lines if I was given an instruction set document and you were given an instruction set document and told to create a processor or write an instruction set simulator. Would we produce the exact same code? Even if the variable names were different? No. Ideally we would have programs that are functionally the same, they both parse the instruction and execute it. Logic is no different you give the spec to two engineers you might get two different processors that functionally are the same, one might perform better, etc. Look at the long running processor families, x86 in particular, they re-invent that every couple-three years being instruction set compatible for the legacy instructions while sometimes adding new instructions. Same for ARM and others.
And there are different instruction sets ARM is different from x86 is different from MIPS, the opcodes and/or bits you examine in the instruction vary, for none of these can you simply look at 8 bits, each you have some bits then if that is not enough to uniquely identify the instruction/operation then you need to examine some more bits, where those bits are what the rules are are very specific to each architecture. Otherwise what would be the point of having different names for them if they were the same.
And this information was out there you just didnt look in the right places, there are countless open online courses on the topic, books that google should hit some pages on, as well as open source processor cores you can look at and countless instruction set simulators with source code.

Are muxes more "expensive" than other logic?

This is mostly out of curiosity.
One fragment from some VHDL code that I've been working on recently resembles the following:
led_q <= (pwm_d and ch_ena) when pwm_ena = '1' else ch_ena;
This is a mux-style expression, of course. But it's also equivalent to the following basic logic expression (at least when ignoring non-binary states):
led_q <= ch_ena and (pwm_d or not pwm_ena);
Is one "better" than the other in terms of logic utilisation or efficiency when actually implemented in an FPGA? Is it preferable to use one over the other, or is the compiler smart enough to pick the "best" on its own?
(For the curious, the purpose of the expression is to define the state of an LED -- if ch_ena is false it should always be off as the channel is disabled, otherwise it should either be on solidly or flashing according to pwm_d, according to pwm_ena (PWM enable). I think the first form describes this more obviously than the second, although it's not too hard to realise how the second behaves.)
For a simple logical expression, like the one shown, where the synthesis tool can easily create a complete truth table, the expression is likely to be converted to an internal truth table, which is then directly mapped to the available FPGA LUT resources. Since the truth table is identical for the two equivalent expressions, the hardware will also be the same.
However, for complex expressions where a complete truth table can't be generated, e.g. when using arithmetic operations, and/or where dedicated resources are available, the synthesis tool may choose to hold an internal representation that is more closely related to the original VHDL code, and in this case the VHDL coding style can have a great impact on the resulting logic, even for equivalent expressions.
In the end, the implementation is tool specific, so the best way to find out what logic is generated is to try it with the specific tool, in special for large or timing critical parts of the design, where the implementation is critical.
In general it depends on the target architecture. For Xilinx FPGAs the logic is mostly mapped into LUTs with sporadic use of the hard logic resources where the mapper can make use of them. Every possible LUT configuration has essentially equal performance so there's little benefit to scrutinizing the mapper's work unless you're really pushing the speed limits of the device where you'd be forced into manually instantiating hand-mapped LUTs.
Non-LUT based architectures like the Actel/Microsemi device families use 2-input muxes as the main logic primitive and everything is mapped down to them. You can't generalize what is best across all types of FPGAs and CPLDs but nowadays you can mostly trust that the mapper will do a decent enough job using timing constraints to push it toward the results you need.
With regards to the question I think it is best to avoid obscure Boolean expressions where possible. They tend to be hard to decipher months later when you forgot what you meant them to do. I would lean toward the when-else simply from a code maintenance point of view. Even for this trivial example you have to think closely about what behavior it describes whereas the when-else describes the intended behavior directly in human level syntax.
HDLs work best when you use the highest abstraction possible and avoid wallowing around with low-level bit twiddling. This is a place where VHDL truly shines if you leverage the more advanced features of the language and move away from describing raw logic everywhere. Let the synthesizer do the work. Introductory learning materials focus on the low level structural gate descriptions and logic expressions because that is easiest for beginners to get a start on but it is not the best way to use VHDL for complex designs in the long run.
Of course there are situations where Booleans are better, particularly when doing bitwise operations across vectors in parallel which requires messy loops to do the same imperatively. It all depends on the context.

How expensive is data type conversion vs. bit array manipulation in VHDL?

In VHDL, if you want to increment a std_logic_vector that represents a real number by one, I have come across a few options.
1) Use typecasting datatype conversion functions to change the std_logic vector to a signed or unsigned value, then convert it to an integer, add one to that integer, and convert it back to a std_logic_vector the opposite way than before. The chart below is handy when trying to do this.
2) Check to see the value of the LSB. If it is a '0', make it a '1'. If it is a '1', do a "shift left" and concatenate a '0' to the LSB. Ex: (For a 16 bit vector) vector(15 downto 1) & '0';
In an FPGA, as compared to a microprocessor, physical hardware resources seem to be the limiting factor instead of actual processing time. There is always the risk that you could run out of physical gates.
So my real question is this: which one of these implementations is "more expensive" in an FPGA and why? Are the compilers robust enough to implement the same physical representation?
None of the type conversions cost.
The different types are purely about expressing the design as clearly as possible - not only to other readers (or yourself, next year:-) but also to the compiler, letting it catch as many errors as possible (such as, this integer value is out of range)
Type conversions are your way of telling the compiler "yes, I meant to do that".
Use the type that best expresses the design intent.
If you're using too many type conversions, that usually means something has been declared as the wrong type; stop and think about the design for a bit and it will often simplify nicely. If you want to increment a std_logic_vector, it should probably be an unsigned, or even a natural.
Then convert when you have to : often at top level ports or other people's IP.
Conversions may infinitesimally slow down simulations, but that's another matter.
As for your option 2 : low level detailed descriptions are not only harder to understand than a <= a + 1; but they are no easier for synth tools to translate, and more likely to contain bugs.
I am giving another answer to better answer why in terms of gates and FPGA resources, it really doesn't matter which method you use. At the end, the logic will be implemented in Look-Up-Tables and flip flops. Usually (or always?) there are no native counters in the FPGA fabric. The synthesis will turn your code into LUTs, period. I always recommend trying to express the code as simple as possible. The more you try to write your code in RTL (vs. behavioral) the more error prone it will be. KISS is the appropriate course of action everytime, The synthesis tool, if any good, will simplify your intent as much as possible.
The only reason to implement arithmetic by hand is if you:
Think you can do a better job than the synthesis tool (where better could be smaller, faster, less power consuming, etc)
and you think the reduced portability and maintainability of your code does not matter too much in the long run
and it actually matters if you do a better job than the synthesis tool (e.g. you can reach your desired operating frequency only by doing this by hand rather than letting the synthesis tool do it for you).
In many cases you can also rewrite your RTL code slightly or use synthesis attributes such as KEEP to persuade the synthesis tool to make more optimal implementation choices rather than hand-implementing arithmetic components.
By the way, a fairly standard trick to reduce the cost of hardware counters is to avoid normal binary arithmetic and instead use for example LFSR counters. See Xilinx XAPP 052 for some inspiration in this area if you are interested in FPGAs (it is quite old but the general principles is the same in current FPGAs).

Large Scale VHDL modularization techniques

I'm thinking about implimenting a 16 bit CPU in VHDL.
A simplish CPU.
ADD, MULS, NEG, BitShift, JUMP, Relitive Jump, BREQ, Relitive BREQ, i don't know somethign along these lines>
Probably all only working with 16bit operands.
I might even cut it down and use only a single operand and a accumulator.
With Some status regitsters, Carry, Zero, Neg (unless i use a Accumlator),
I know how to design all the parts from logic gates, and plan to build them up from first priciples,
So for my ALU I'll need to 'build' a ADDer, proably a Carry Look ahead, group adder,
this adder it self is make up oa a couple of parts, wich are themselves made up of a couple of parts.
Anyway, my problem is not the CPU design, or the VHDL (i know the language, more or less).
It's how i should keep things organised.
How should I use packages,
How should I name my processes and port maps? (i've never seen the benifit of naming the port maps, or processes)
Whatever you do, be sure to read Jiri Gaisler's master work on structured VHDL design method.
http://www.gaisler.com/doc/vhdl2proc.pdf
http://www.gaisler.com/doc/structdes.pdf
You'll be very glad you did.
Looking at some existing examples wouldn't hurt. At the level you're talking about (naming conventions and such) I've never really done much different in hardware design than in software.
As an aside, I'd generally advise against doing things like your own adders and such, unless it's something that's required because it's homework, or something like that. With FPGA's and (to a slightly lesser extent) ASICs, you have an existing "library" of hardware in the device, so some thing like A <= B + c will typically use an adder circuit that's already built into the device in the case of an FPGA or a hand-optimized hard macro in the case of an ASIC.
Writing your own will take a fair amount of extra work, and it'll almost always produce a worse result. In the case of an ASIC, it'll be a little worse; in the case of an FPGA, it'll usually be quite a bit worse.
Edit: I should also note that a simple CPU doesn't really qualify as a large-scale design, at least IMO. Maybe it's due to my background in software, but I've always found CPU design fairly straightforward. Just for one example, the one time I did a DRAM controller, it seemed like a lot more work to me. I don't recall anything like source code line counts, but based on memory, I'd say it was larger (probably by something like 2x). Of course, it'll depend on exactly how simple of a CPU you decide on too...

Resources