How are shifts implemented on the hardware level? - cpu

How are bit shifts implemented at the hardware level when the number to shift by is unknown?
I can't imagine that there would be a separate circuit for each number you can shift by (that would 64 shift circuits on a 64-bit machine), nor can I imagine that it would be a loop of shifts by one (that would take up to 64 shift cycles on a 64-bit machine). Is it some sort of compromise between the two or is there some clever trick?

The circuit is called a "barrel shifter" - it's a load of multiplexers basically. It has a layer per address-bit-of-shift-required, so an 8-bit barrel shifter needs three bits to say "how much to shift by" and hence 3 layers of muxes.
Here's a picture of an 8-bit one from http://www.globalspec.com/reference/55806/203279/chapter-9-additional-circuit-designs:

Related

which one is more complex? calculating a 64 bits CRC or two 32 bits CRCs with different polynomials?

I was wondering how 64-bit CRC on an FPGA compares to two 32-bit CRCs (different polynomials) on the same FPGA. Would two 32 bits CRC be more complicated than performing a single 64-bit CRC? Is it going to take a while or it would be fast?
How can I calculate the complexity (or do a complexity analysis)?
any help would be much appreciated
Thank you.
I was wondering how 64-bit CRC on an FPGA compares to two 32-bit CRCs (different polynomials) on the same FPGA.
On a "normal" FPGA it does not matter which kind of information (CRCs, checksums, floating-point values ...) you compare:
Checking if a 64-bit value equals another 64-bit value takes the same amount of resources (gates or time).
This is of course not true if you use an FPGA that has a built-in CRC unit that (as an example) supports CRC32 but not CRC64 ...
Would two 32 bits CRC be more complicated than performing a single 64-bit CRC?
In both cases you'll need 64 logic cells (this means: 64 LUTs and 64 flip-flops).
In the case of a 64-bit CRC, 63 logic cells must be connected to the previous logic cell and there must be one signal line connecting the first and the last logic cell.
In the case of two 32-bit CRCs, 62 logic cells must be connected to the previous logic cell and there must be two signal lines connecting the first and the last logic cell of each CRC.
If you have an FPGA that allows connecting 64 cells in a row without using a "long" signal line, the 64-bit CRC saves one "long" signal line.
(Edit: On the FPGA on my eval board you can connect 16 cells in a row; on such an FPGA, both onw 64-bit CRC and two 32-bit CRC would cost 5 "long" signal lines.)
Is it going to take a while or it would be fast?
How can I calculate the complexity (or do a complexity analysis)?
You require one clock cycle per bit - in both cases.
Note that an FPGA works completely differently than a computer:
You typically don't need time to perform some operation but all operations are performed at the same time...

Is time cost of integer multiplication the same as any binary operation on ARM or Intel processors?

Is the processing time of an integer multiplication the same as any integer binary operation on modern CPU with pipelining (e.g Intel, ARM) ?
In the Assembly documentation of Intel, it is said that an integer multiplication takes 1 cycle, like any integer binary operation. Is this cycle equivalent to the time duration supposing the operations are pipelined ?
There are more than the cycles to consider:
latency
pipeline
While the results of ALU instructions are instantaneous, multiply instructions have to go through MAC(multiply accumulate) which usually costs more cycles and comes with a latency of multiple cycles.
And often there is only one MAC unit which means the core doesn't allow two mul instructions to be dual issued.
example: ARMv5E:
smulxy(16bit): one cycle plus three cycles latency
mul(32bit): two cycles plus three cycles latency
umull(64bit): three cycles plus four(lower half) and five(upper half) cycles latency
No, multiply is much more complicated than XOR, ADD, OR, NOT, etc. While binary makes it much easier than base 10 you still have to have a larger adder (than just a two operand ADD or other operation).
Take the bits abcd
abcd
* 1011
========
abcd
abcd.
0000..
+abcd...
=========
In base 10 like grade school you had to multiply each time, you are still multiplying here but only by one or zero so either you copy and shift the first operand or you copy and shift zeros. And it gets very big, addition is cascaded. Look up xor gate at wikipedia and see the full adder or just google it. You have a single column adder for a simple two operand add with three inputs and two outputs but the carry out of one bit is the carry in of the other. No logic is instantaneous even a single transistor inversion (NOT) takes a non-zero amount of time. You can start to think about how many gates are lined up just to make one 32 bit two operand ADD, and then think about a 32 bit multiply where each adder is 32 operand bits and some number of carry bits, and then all of that is cascaded. The chip real estate and the time to settle multiply almost exponentially for multiply, and you then start to worry about can you meet timing (can you settle the msbit of the result within the desired/designed clock speed).
So you will see optimizations made including multiple pipe stages, not 32 clocks to do a 32 bit multiply but maybe not one clock maybe two or four. With a dozen stage deep pipe though you can bury that in there and still meet an advertised one clock per instruction average.
Intel, ARM, etc the 1 cycle thing is an illusion, the math operation itself might take that long, but the execution of the instruction takes a few to a handful, and your pipe depths may be several to a dozen or more. There is limited use in attempting to count cycles these days. And feeding the pipe and handling memory operations tend to dominate the performance not the pipe/instructions themselves outside a carefully crafted sim of the core.
For the cortex-ms which are perhaps not what you are asking about but are very much part of our daily life you see in the documentation that it is the chip vendor that can choose the larger faster multiply or the slower smaller that helps with overall chip size and perhaps performance. (I do not examine the cortex-a docs that much as I do not use them as often) A compile time option when they compile the core, there are many compile time options (which is why for any arm core cortex-m or cortex-a) you cannot compare, say, two cortex-m4s from different vendors or chip families within a vendor as they could have been compiled differently and behave/perform differently (they still execute the enabled instructions in the same functional way of course).
So no you cannot assume the "execution time" or "cycle time" of ANY instruction, and in particular ones like multiply and divide and anything floating point cannot assumed to be single cycle. Yes like all the other instructions the one cycle advertised is based on the pipeline effects, no instruction takes one cycle start to finish, and based on pipe depth of the design the multiply and divide may take more than one clock but be hidden by the pipe to still average one clock per instruction.
Note that this question is "too broad", as there are many Intel and ARM implementations past and present. And chip implementation details are often not available or protected by NDA, all you have if anything are public documents that can hide the reality.

Why is the register length static in any CPU

Why is the register length (in bits) that a CPU operates on not dynamically/manually/arbitrarily adjustable? Would it make the computer slower if it was adjustable this way?
Imagine you had an 8-bit integer. If you could adjust the CPU register length to 8 bits, the CPU would only have to go through the first 8 bits instead of extending the 8-bit integer to 64 bits and then going through all 64 bits.
At first I thought you were asking if it was possible to have a CPU with no definitive register size. That make no sense since the number and size of the registers is a physical property of the hardware and cannot be changed.
However some architecture let the programmer work on a smaller part of a register or to pair registers.
The x86 does both for example, with add al, 9 (uses only 8 bits of the 64-bit rax) and div rbx (pairs rdx:rax to form a 128-bit register).
The reason this scheme is not so diffuse is that it comes with a lot of trade-offs.
More registers means more bits needed to address them, simply put: longer instructions.
Longer instructions mean less code density, more complex decoders and less performance.
Furthermore most elementary operations, like the logic ones, addition and subtraction are already implemented as operating on a full register in a single cycle.
Finally, one execution unit can handle only one instruction at a time, we cannot issue eight 8-bit additions in a 64-bit ALU at the same time.
So there wouldn't be any improvement, nor in the latency nor in the throughput.
Accessing partial registers is useful for the programmer to fan-out the number of available registers, so for example if an algorithm works with 16-bit data, the programmer could use a single physical 64-bit register to store four items and operate on them independently (but not in parallel).
The ISAs that have variable length instructions can also benefit from using partial register because that usually means smaller immediate values, for example and instruction that set a register to a specific value usually have an immediate operand that matches the size of register being loaded (though RISC usually sign-extends or zero-extends it).
Architectures like ARM (presumably others as well) supports half precision floats. The idea is to do what you were speculating and #Margaret explained. With half precision floats, you can pack two float values in a single register, thereby introducing less bandwidth at a cost of reduced accuracy.
Reference:
[1] ARM
[2] GCC

256 bit fixed point arithmetic, the future?

Just some silly musings, but if computers were able to efficiently calculate 256 bit arithmetic, say if they had a 256 bit architecture, I reckon we'd be able to do away with floating point. I also wonder, if there'd be any reason to progress past 256 bit architecture? My basis for this is rather flimsy, but I'm confident that you'll put me straight if I'm wrong ;) Here's my thinking:
You could have a 256 bit type that used the 127 or 128 bits for integers, 127 or 128 bits for fractional values, and then of course a sign bit. If you had hardware that was capable of calculating, storing and moving such big numbers with no problems, I reckon you'd be set to handle any calculation you'd come across.
One example: If you were working with lengths, and you represented all values in meters, then the minimum value (2^-128 m) would be smaller than the planck length, and the biggest value (2^127 m) would be bigger than the diameter of the observable universe. Imagine calculating light-years of distances with a precision smaller than a planck length?
Ok, that's only one example, but I'm struggling to think of any situations that could possibly warrant bigger and smaller numbers than that. Any thoughts? Are there possible problems with fixed point arithmetic that I haven't considered? Are there issues with creating a 256 bit architecture?
SIMD will make narrow types valuable forever. If you can do a 256bit add, you can do eight 32bit integer adds in parallel on the same hardware (by not propagating carry across element boundaries). Or you can do thirty-two 8bit adds.
Hardware multiplier circuits are a lot more expensive to make wider, so it's not a good assumption to assume that a 256b X 256b multiplier will be practical to build.
Even besides SIMD considerations, memory bandwidth / cache footprint is a huge deal.
So 4B float will continue to be excellent for being precise enough to be useful, but small enough to pack many elements into a big vector, or in cache.
Floating-point also allows a much wider range of numbers by using some of its bits as an exponent. With mantissa = 1.0, the range of IEEE binary64 double goes from 2-1022 to 21023, for "normal" numbers (53-bit mantissa precision over the whole range, only getting worse for denormals (gradual underflow)). Your proposal only handles numbers from about 2-127 (with 1 bit of precision) to 2127 (with 256b of precision).
Floating point has the same number of significant figures at any magnitude (until you get into denormals very close to zero), because the mantissa is fixed width. Normally this is a useful property, especially when multiplying or dividing. See Fixed Point Cholesky Algorithm Advantages for an example of why FP is good. (Subtracting two nearby numbers is a problem, though...)
Even though current SIMD instruction sets already have 256b vectors, the widest element width is 64b for add. AVX2's widest multiply is 32bit * 32bit => 64bit.
AVX512DQ has a 64b * 64b -> 64b (low half) vpmullq, which may show up in Skylake-E (Purley Xeon).
AVX512IFMA introduces a 52b * 52b + 64b => 64bit integer FMA. (VPMADD52LUQ low half and VPMADD52HUQ high half.) The 52 bits input precision is clearly so they can use the FP mantissa multiplier hardware, instead of requiring separate 64bit integer multipliers. (A full vector width of 64bit full-multipliers would be even more expensive than vpmullq. A compromise design like this even for 64bit integers should be a big hint that wide multipliers are expensive). Note that this isn't part of baseline AVX512F either, and may show up in Cannonlake, based on a Clang git commit.
Supporting arbitrary-precision adds/multiplies in SIMD (for crypto applications like RSA) is possible if the instruction set is designed for it (which Intel SSE/AVX isn't). Discussion on Agner Fog's recent proposal for a new ISA included an idea for SIMD add-with-carry.
For actually implementing 256b math on 32 or 64-bit hardware, see https://locklessinc.com/articles/256bit_arithmetic/ and https://gmplib.org/. It's really not that bad considering how rarely it's needed.
Another big downside to building hardware with very wide integer registers is that even if the upper bits are usually unused, out-of-order execution hardware needs to be able to handle the case where it is used. This means a much larger physical register file compared to an architecture with 64-bit registers (which is bad, because it needs to be very fast and physically close to other parts of the CPU, and have many read ports). e.g. Intel Haswell has 168-entry PRFs for integer and FP/SIMD.
The FP register file already has 256b registers, so I guess if you were going to do something like this, you'd do it with execution units that used the SIMD vector registers as inputs/outputs, not by widening the integer registers. But the FP/SIMD execution units aren't normally connected to the integer carry flag, so you might need a separate SIMD-carry register for 256b add.
Intel or AMD already could have implemented an instruction / execution unit for adding 128b or 256b integers in xmm or ymm registers, but they haven't. (The max SIMD element width even for addition is 64-bit. Only shuffles operate on the whole register as a unit, and then only with byte-granularity or wider.)
128 bit computers. It is also about addressing memory and when we run out 64-bits when addressing memory. Currently there are servers with 4TB memory. That requires about 42 bits (2^42 > 4 x 10^12). If we assume that memory prices halves every second year then we need one bit more every second year. We still have 22 bits left so at least 2 * 22 years and it is likely that memory prices are not dropping that fast -> more than 50 years when we run out of 64-bits addressing capabilities.

How do bits become a byte? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Possibly the most basic computer question of all time, but I have not been able to find a straightforward answer and it is driving me crazy. When a computer 'reads' a byte, does it read it as a sequential series of ones and zeros one after the other, or does it somehow read all 8 ones and zeros at once?
A computer system reads the data in both ways depending upon type of operation and the how the digital system is designed.I'll explain this with very simple example of a Full adder circuit.
A full adder adds binary numbers and accounts for values carried in as well as out (Wikipedia)
Example of Parallel operation
Suppose in some task we need to add two 8 bit(1 byte) numbers such that all bits are available at the time of addition.
Then in that case we can design a digital system with 8 full-adders(1 for each bit).
Example of Serial Operation
In some other task you observe that all 8 bits will not be simultaneously available.
Or you think having 8 separate adders is costly as you need to implement other mathematical operations (like subtraction,multiplication and division). So instead of having 8 separate units you have 1 unit which will individually process bits. In this scenario we will need three storage units ( Shift Registers) such that two storage units will store two 8-bit numbers and one storage units will store the result .At a given clock pulse single bit will be transmitted from each of two registers to the full adder which will perform the addition process and transfer 1 bit result to the result shift register in single clock pulse.
This figure contains some additional stuff which is not useful for this thread but you can
study digital logic design and computer architecture if you want to go more deep in this stuff.
Shift register
Shift register operations demo
This is really kind of outside the scope of Stackoverflow, but it brings back such fond memories from college.
It depends. Some times a computer reads bits one at a time. For example over older ethernet manchester code is used. However over old parallel printer cables, there were 8 pins each one signaling a bit, and an entireoctet (byte) is sent at once.
In serial (one-bit-at-a-time) encodings, you're typically measuring transitions in the line or transitions against some well-defined clock source.
In parallel encodings, you're typically reading all the bits into a register at a time and latching the register.
Look up flipflops, registers, and logic gates for information on the low-level parts of this.
Bits are transmitted one at a time in serial transmission, and
multiple numbers of bits in parallel transmission. A bitwise operation
optionally process bits one at a time. Data transfer rates are usually
measured in decimal SI multiples of the unit bit per second (bit/s),
such as kbit/s.
Wikipedia's article on Bit
the processor works with a defined number of registerlength. 8, 16, 32, 64 ... think about a register as an amount of connection, one for each bit... thats the amount of bits that will be processed at once in one processor core, one register at once ... the processor hat different kinds of register, examples are the private instruction register or the public data or adress register
Think of it this way, at least at a physical level: In a transmission cable from point A to B (A and B can be anything, hard drive, CPU, RAM, USB, etc.) each wire in that cable can transmit one bit at a time. Both A and B have a clock pulsing at the same rate. On each pulse, the sender changes the amount of power going down each wire to signify the value of the new bit(s). So, the # of wires in the cable = the # of bits that can be transmitted each "pulse". (Note: This is a very simplified and theoretical explanation).
At a software level, in the CPU, you can never address anything smaller than a byte. You can "access" and manipulate specific bytes by using the bitwise operators (& (AND), | (OR), << (Left Shift), >> (Right Shift), ^ (XOR)).
In hardware, the number of bits being sent each pulse is completely dependent of the hardware itself.

Resources