which one is more complex? calculating a 64 bits CRC or two 32 bits CRCs with different polynomials? - complexity-theory

I was wondering how 64-bit CRC on an FPGA compares to two 32-bit CRCs (different polynomials) on the same FPGA. Would two 32 bits CRC be more complicated than performing a single 64-bit CRC? Is it going to take a while or it would be fast?
How can I calculate the complexity (or do a complexity analysis)?
any help would be much appreciated
Thank you.

I was wondering how 64-bit CRC on an FPGA compares to two 32-bit CRCs (different polynomials) on the same FPGA.
On a "normal" FPGA it does not matter which kind of information (CRCs, checksums, floating-point values ...) you compare:
Checking if a 64-bit value equals another 64-bit value takes the same amount of resources (gates or time).
This is of course not true if you use an FPGA that has a built-in CRC unit that (as an example) supports CRC32 but not CRC64 ...
Would two 32 bits CRC be more complicated than performing a single 64-bit CRC?
In both cases you'll need 64 logic cells (this means: 64 LUTs and 64 flip-flops).
In the case of a 64-bit CRC, 63 logic cells must be connected to the previous logic cell and there must be one signal line connecting the first and the last logic cell.
In the case of two 32-bit CRCs, 62 logic cells must be connected to the previous logic cell and there must be two signal lines connecting the first and the last logic cell of each CRC.
If you have an FPGA that allows connecting 64 cells in a row without using a "long" signal line, the 64-bit CRC saves one "long" signal line.
(Edit: On the FPGA on my eval board you can connect 16 cells in a row; on such an FPGA, both onw 64-bit CRC and two 32-bit CRC would cost 5 "long" signal lines.)
Is it going to take a while or it would be fast?
How can I calculate the complexity (or do a complexity analysis)?
You require one clock cycle per bit - in both cases.
Note that an FPGA works completely differently than a computer:
You typically don't need time to perform some operation but all operations are performed at the same time...

Related

Is time cost of integer multiplication the same as any binary operation on ARM or Intel processors?

Is the processing time of an integer multiplication the same as any integer binary operation on modern CPU with pipelining (e.g Intel, ARM) ?
In the Assembly documentation of Intel, it is said that an integer multiplication takes 1 cycle, like any integer binary operation. Is this cycle equivalent to the time duration supposing the operations are pipelined ?
There are more than the cycles to consider:
latency
pipeline
While the results of ALU instructions are instantaneous, multiply instructions have to go through MAC(multiply accumulate) which usually costs more cycles and comes with a latency of multiple cycles.
And often there is only one MAC unit which means the core doesn't allow two mul instructions to be dual issued.
example: ARMv5E:
smulxy(16bit): one cycle plus three cycles latency
mul(32bit): two cycles plus three cycles latency
umull(64bit): three cycles plus four(lower half) and five(upper half) cycles latency
No, multiply is much more complicated than XOR, ADD, OR, NOT, etc. While binary makes it much easier than base 10 you still have to have a larger adder (than just a two operand ADD or other operation).
Take the bits abcd
abcd
* 1011
========
abcd
abcd.
0000..
+abcd...
=========
In base 10 like grade school you had to multiply each time, you are still multiplying here but only by one or zero so either you copy and shift the first operand or you copy and shift zeros. And it gets very big, addition is cascaded. Look up xor gate at wikipedia and see the full adder or just google it. You have a single column adder for a simple two operand add with three inputs and two outputs but the carry out of one bit is the carry in of the other. No logic is instantaneous even a single transistor inversion (NOT) takes a non-zero amount of time. You can start to think about how many gates are lined up just to make one 32 bit two operand ADD, and then think about a 32 bit multiply where each adder is 32 operand bits and some number of carry bits, and then all of that is cascaded. The chip real estate and the time to settle multiply almost exponentially for multiply, and you then start to worry about can you meet timing (can you settle the msbit of the result within the desired/designed clock speed).
So you will see optimizations made including multiple pipe stages, not 32 clocks to do a 32 bit multiply but maybe not one clock maybe two or four. With a dozen stage deep pipe though you can bury that in there and still meet an advertised one clock per instruction average.
Intel, ARM, etc the 1 cycle thing is an illusion, the math operation itself might take that long, but the execution of the instruction takes a few to a handful, and your pipe depths may be several to a dozen or more. There is limited use in attempting to count cycles these days. And feeding the pipe and handling memory operations tend to dominate the performance not the pipe/instructions themselves outside a carefully crafted sim of the core.
For the cortex-ms which are perhaps not what you are asking about but are very much part of our daily life you see in the documentation that it is the chip vendor that can choose the larger faster multiply or the slower smaller that helps with overall chip size and perhaps performance. (I do not examine the cortex-a docs that much as I do not use them as often) A compile time option when they compile the core, there are many compile time options (which is why for any arm core cortex-m or cortex-a) you cannot compare, say, two cortex-m4s from different vendors or chip families within a vendor as they could have been compiled differently and behave/perform differently (they still execute the enabled instructions in the same functional way of course).
So no you cannot assume the "execution time" or "cycle time" of ANY instruction, and in particular ones like multiply and divide and anything floating point cannot assumed to be single cycle. Yes like all the other instructions the one cycle advertised is based on the pipeline effects, no instruction takes one cycle start to finish, and based on pipe depth of the design the multiply and divide may take more than one clock but be hidden by the pipe to still average one clock per instruction.
Note that this question is "too broad", as there are many Intel and ARM implementations past and present. And chip implementation details are often not available or protected by NDA, all you have if anything are public documents that can hide the reality.

Parallel Crc-32 calculation Ethernet 10GE MAC

I have generated an Ethernet 10GE MAC design in VHDL. Now I am trying to implement CRC. I have a 64-bit parallel CRC-32 generator in VHDL.
Specification:
- Data bus is of 64-bits
- Control bus is of 8-bits (which validates the data bytes)
Issue:
Let's say, my incoming packet length is 14-bytes, (assuming no padding).
The CRC is calculated for the first 8 bytes in one clock cycle, but when I try to calculate the CRC over the remaining 6 bytes the results are wrong due to zeros being appended.
Is there a way I can generate the CRC for any length of bytes packet length using a 64-bit parallel CRC generator?
What I've tried:
I used different parallel CRC generators (8-bit parallel CRC, 16-bit parallel CRC generator and so on). But that consumes a lot of FPGA resources. I want to conserve resources using just 64-bit parallel CRC generators.
Start with a constant 64-bit data word that brings the effective CRC register to all zeros. Then prepend the message with zero bytes, instead of appending them, putting those zeros on the end of the 64-bit word that is processed first. (You did not provide the CRC definition, so this depends on whether the CRC is reflected or not. If the CRC is reflected, then put the zero bytes in the least-significant bit positions. If the CRC is not reflected, then put them in the most-significant bit positions.) Then exclusive-or the result with a 32-bit constant.
So for the example, you would first feed a 64-bit constant to the parallel CRC generator, then feed two zero bytes and six bytes of message in the first word, and then eight message bytes in the second word. Then exclusive-or the result with the 32-bit constant.
For the standard PKZIP CRC, the 64-bit constant is 0x00000000ffffffff, the 32-bit constant is 0x2e448638, and the prepended zero bytes go in the bottom of the 64-bit word.
If you are in control of the implementation of the CRC generator, then you can probably modify it to initialize the effective CRC register to all zeros when you reset the generator, avoiding the need to feed the 64-bit constant.
I can't speak for certain, but if you can pad zeros at the start of your packet instead of at the end, then you should get the right answer. It does depend on the polynomial and the initializer...
See this answer here Best way to generate CRC8/16 when input is odd number of BITS (not byte)? C or Python

Why is the register length static in any CPU

Why is the register length (in bits) that a CPU operates on not dynamically/manually/arbitrarily adjustable? Would it make the computer slower if it was adjustable this way?
Imagine you had an 8-bit integer. If you could adjust the CPU register length to 8 bits, the CPU would only have to go through the first 8 bits instead of extending the 8-bit integer to 64 bits and then going through all 64 bits.
At first I thought you were asking if it was possible to have a CPU with no definitive register size. That make no sense since the number and size of the registers is a physical property of the hardware and cannot be changed.
However some architecture let the programmer work on a smaller part of a register or to pair registers.
The x86 does both for example, with add al, 9 (uses only 8 bits of the 64-bit rax) and div rbx (pairs rdx:rax to form a 128-bit register).
The reason this scheme is not so diffuse is that it comes with a lot of trade-offs.
More registers means more bits needed to address them, simply put: longer instructions.
Longer instructions mean less code density, more complex decoders and less performance.
Furthermore most elementary operations, like the logic ones, addition and subtraction are already implemented as operating on a full register in a single cycle.
Finally, one execution unit can handle only one instruction at a time, we cannot issue eight 8-bit additions in a 64-bit ALU at the same time.
So there wouldn't be any improvement, nor in the latency nor in the throughput.
Accessing partial registers is useful for the programmer to fan-out the number of available registers, so for example if an algorithm works with 16-bit data, the programmer could use a single physical 64-bit register to store four items and operate on them independently (but not in parallel).
The ISAs that have variable length instructions can also benefit from using partial register because that usually means smaller immediate values, for example and instruction that set a register to a specific value usually have an immediate operand that matches the size of register being loaded (though RISC usually sign-extends or zero-extends it).
Architectures like ARM (presumably others as well) supports half precision floats. The idea is to do what you were speculating and #Margaret explained. With half precision floats, you can pack two float values in a single register, thereby introducing less bandwidth at a cost of reduced accuracy.
Reference:
[1] ARM
[2] GCC

256 bit fixed point arithmetic, the future?

Just some silly musings, but if computers were able to efficiently calculate 256 bit arithmetic, say if they had a 256 bit architecture, I reckon we'd be able to do away with floating point. I also wonder, if there'd be any reason to progress past 256 bit architecture? My basis for this is rather flimsy, but I'm confident that you'll put me straight if I'm wrong ;) Here's my thinking:
You could have a 256 bit type that used the 127 or 128 bits for integers, 127 or 128 bits for fractional values, and then of course a sign bit. If you had hardware that was capable of calculating, storing and moving such big numbers with no problems, I reckon you'd be set to handle any calculation you'd come across.
One example: If you were working with lengths, and you represented all values in meters, then the minimum value (2^-128 m) would be smaller than the planck length, and the biggest value (2^127 m) would be bigger than the diameter of the observable universe. Imagine calculating light-years of distances with a precision smaller than a planck length?
Ok, that's only one example, but I'm struggling to think of any situations that could possibly warrant bigger and smaller numbers than that. Any thoughts? Are there possible problems with fixed point arithmetic that I haven't considered? Are there issues with creating a 256 bit architecture?
SIMD will make narrow types valuable forever. If you can do a 256bit add, you can do eight 32bit integer adds in parallel on the same hardware (by not propagating carry across element boundaries). Or you can do thirty-two 8bit adds.
Hardware multiplier circuits are a lot more expensive to make wider, so it's not a good assumption to assume that a 256b X 256b multiplier will be practical to build.
Even besides SIMD considerations, memory bandwidth / cache footprint is a huge deal.
So 4B float will continue to be excellent for being precise enough to be useful, but small enough to pack many elements into a big vector, or in cache.
Floating-point also allows a much wider range of numbers by using some of its bits as an exponent. With mantissa = 1.0, the range of IEEE binary64 double goes from 2-1022 to 21023, for "normal" numbers (53-bit mantissa precision over the whole range, only getting worse for denormals (gradual underflow)). Your proposal only handles numbers from about 2-127 (with 1 bit of precision) to 2127 (with 256b of precision).
Floating point has the same number of significant figures at any magnitude (until you get into denormals very close to zero), because the mantissa is fixed width. Normally this is a useful property, especially when multiplying or dividing. See Fixed Point Cholesky Algorithm Advantages for an example of why FP is good. (Subtracting two nearby numbers is a problem, though...)
Even though current SIMD instruction sets already have 256b vectors, the widest element width is 64b for add. AVX2's widest multiply is 32bit * 32bit => 64bit.
AVX512DQ has a 64b * 64b -> 64b (low half) vpmullq, which may show up in Skylake-E (Purley Xeon).
AVX512IFMA introduces a 52b * 52b + 64b => 64bit integer FMA. (VPMADD52LUQ low half and VPMADD52HUQ high half.) The 52 bits input precision is clearly so they can use the FP mantissa multiplier hardware, instead of requiring separate 64bit integer multipliers. (A full vector width of 64bit full-multipliers would be even more expensive than vpmullq. A compromise design like this even for 64bit integers should be a big hint that wide multipliers are expensive). Note that this isn't part of baseline AVX512F either, and may show up in Cannonlake, based on a Clang git commit.
Supporting arbitrary-precision adds/multiplies in SIMD (for crypto applications like RSA) is possible if the instruction set is designed for it (which Intel SSE/AVX isn't). Discussion on Agner Fog's recent proposal for a new ISA included an idea for SIMD add-with-carry.
For actually implementing 256b math on 32 or 64-bit hardware, see https://locklessinc.com/articles/256bit_arithmetic/ and https://gmplib.org/. It's really not that bad considering how rarely it's needed.
Another big downside to building hardware with very wide integer registers is that even if the upper bits are usually unused, out-of-order execution hardware needs to be able to handle the case where it is used. This means a much larger physical register file compared to an architecture with 64-bit registers (which is bad, because it needs to be very fast and physically close to other parts of the CPU, and have many read ports). e.g. Intel Haswell has 168-entry PRFs for integer and FP/SIMD.
The FP register file already has 256b registers, so I guess if you were going to do something like this, you'd do it with execution units that used the SIMD vector registers as inputs/outputs, not by widening the integer registers. But the FP/SIMD execution units aren't normally connected to the integer carry flag, so you might need a separate SIMD-carry register for 256b add.
Intel or AMD already could have implemented an instruction / execution unit for adding 128b or 256b integers in xmm or ymm registers, but they haven't. (The max SIMD element width even for addition is 64-bit. Only shuffles operate on the whole register as a unit, and then only with byte-granularity or wider.)
128 bit computers. It is also about addressing memory and when we run out 64-bits when addressing memory. Currently there are servers with 4TB memory. That requires about 42 bits (2^42 > 4 x 10^12). If we assume that memory prices halves every second year then we need one bit more every second year. We still have 22 bits left so at least 2 * 22 years and it is likely that memory prices are not dropping that fast -> more than 50 years when we run out of 64-bits addressing capabilities.

How are shifts implemented on the hardware level?

How are bit shifts implemented at the hardware level when the number to shift by is unknown?
I can't imagine that there would be a separate circuit for each number you can shift by (that would 64 shift circuits on a 64-bit machine), nor can I imagine that it would be a loop of shifts by one (that would take up to 64 shift cycles on a 64-bit machine). Is it some sort of compromise between the two or is there some clever trick?
The circuit is called a "barrel shifter" - it's a load of multiplexers basically. It has a layer per address-bit-of-shift-required, so an 8-bit barrel shifter needs three bits to say "how much to shift by" and hence 3 layers of muxes.
Here's a picture of an 8-bit one from http://www.globalspec.com/reference/55806/203279/chapter-9-additional-circuit-designs:

Resources