How can you deal with BOTH signed and unsigned numbers in VHDL? - vhdl

I'm writing a program that needs to work for signed AND unsigned numbers. You take a 32 bit input, first 24 bits is a whole number, last 8 bits is a fraction. Depending on what the fraction is you round up or down. Pretty simple, but how would you write a program that will work whether the input is signed OR unsigned? Do you just make two separate code blocks that execute depending on if a number is unsigned or not?

Your program would need to be aware of the source if the data, and from that information derive whether or not the number is signed. Otherwise, how is your program to know whether a vector of bits is (un)signed? Signage is a convention for humans to use to structure data. The hardware you implement just sees a vector of bits.

A 32-bit unsigned number with 8 fraction bits can represent numbers in the range 0 to ((2^32)-1)/256.
A 32-bit signed number with 8 fraction bits can represent numbers in the range -(2^31)/256 to ((2^31)-1)/256.
So, how about converting your 32-bit input (signed or unsigned) to 33-bit signed, which will be able to represent numbers in the range -(2^32)/256 to ((2^32)-1)/256, which will cover your whole range of inputs.
(You have not given any code. In addition to your 32-bit input, there must be some other input to signal whether those 32 bits represent an unsigned or a signed number. You'll need to test that input and do the appropriate conversion based on its state.)

Related

Can overflow occur when a positive number is subtracted from a positive number resulting in a negative number?

I am working with the mips assembly language but am confused on the overflow aspect of arithmetic here.
Say I am subtracting 25 from 20 and end up with -5. Would this result in an overflow?
I understand that with addition if you add 2 positive numbers or 2 negative numbers and the output is the opposite sign then there is overflow but am lost when it comes to subtraction.
Find the examples at the extreme, let's do it in 8 bits signed to make it simple but the same principles holds in 32 bits
minuend: the smallest possible positive (non-negative) number, 0 and
subtrahend: the largest possible number, 127
When we do 0 - 127, the answer is -127 and that indeed fits in 8 bits signed.  There is a borrow.  In any processor the effect of the borrow is to propagate 1's though out the upper bits, making it a negative number of the proper magnitude.
Different processors set flags differently based on this borrow, MIPS doesn't have flags, while x86 will set flags to indicate borrow, and other processors will set the flags to indicate carry.
In 8 bit signed numbers:
minuend: the smallest possible positive (non-negative) number, 0 and
subtrahend: the largest possible number, -128
When we do 0 - -128, the answer should be 128, but that cannot be represented in 8 bit signed format, so this is the example of overflow.  0 - -127 = 127 and that can be represented, so no overflow there.
If we do it in 8 bits unsigned, your example of 25-20 = -5, but -5 cannot be represented in an unsigned format, so this is indeed overflow (or modular arithmetic, if you like that).
Short answer: Yes, as the 32-bit representation of -5 is FFFFFFFB.
Long answer: Depends on what you mean by "overflow".
There's signed overflow, which is when you cross the 7FFFFFFF-80000000 boundary.
And there's unsigned overflow where you cross the FFFFFFFF-00000000 boundary.
For signed arithmetic, signed overflow is undeniably a bad thing (and is considered undefined behavior in C and other languages). However, unsigned overflow is not necessarily a problem. Usually it is, but many procedures rely on it to work.
For example, imagine you have a "frame timer" variable, i.e. a 32-bit counter variable that increments by 1 during an interrupt service routine. This interrupt is tied to a real-time clock running at 60 hertz, so every 1/60th of a second the variable's value increases by 1.
Now, this variable will overflow eventually. But do we really care? No. It just wraps around back to zero again. For our purposes, it's fine, since we really don't need to know that accurately how long our program has been running since it began. We probably have events that occur every n ticks of the timer, but we can just use a bitmask for that. Effectively in this case we're using unsigned overflow to say "if this value equals FFFFFFFF and we're about to add 1 to it, reset it to zero instead." Which thanks to overflow we can easily implement without any additional condition checking.
The reason I bring this up is so that you understand that overflow is not always a bad thing, if it's the unsigned variety. It depends entirely on what your data is intended to represent (which is something you can't explain even to a C compiler.)

Why do some arithmetic instructions have a signed/unsigned variant and some don't

Assume we have:
a = 0b11111001;
b = 0b11110011;
If we do Addition and Multiplication on paper with hand we get this result, we don't care if its signed or not:
a + b = 111101100
a * b = 1110110001011011
I know that Multiplication doubles the width and addition could overflow:
Why is imul used for multiplying unsigned numbers?
Why do some CPUs have different instructions to do signed and unsigned operations?
My question is, why instructions like Add don't usually have a signed/unsigned version, but Multiply and Divide do?
Why can't we have a generic unsigned multiply, do the math like I did above and truncate the result if its singed, same way Add does.
Or the other, why can't Add have a signed/unsigned version. I have checked a few architectures and this seems to be the case.
I think your choice of example misled you into thinking the signed product could be obtained by truncating the 8x8 => 16-bit unsigned product down to 8 bits. That is not the case.
(249-256) * (243-256) = 0x005b, a small positive result that happens to fit in the low half of the full result. But the full signed result is not always the operand-size truncation of the unsigned product.
For example, -128 * 127 is -16256, or as 16-bit 2's complement, 0xc080.
But 0x80 * 0x7f is + 16256, i.e. 0x3f80. Same low half, different upper half.
Or for another example, see Why are signed and unsigned multiplication different instructions on x86(-64)?
Widening signed-multiply doesn't involve any truncation. The low half of signed and unsigned multiply is the same, that's why x86 for example only has immediate and 2-operand forms of imul, not also mul. Only widening multiply needs a separate form. (Or if you want FLAGS set according to unsigned overflow of the low half, instead of signed overflow. - so you can't easily use non-widening imul if you want to detect when the full unsigned result didn't fit.)

What is the minimum number of bits I need to express a n-bit, signed std_logic_vector in VHDL?

I'm new to VHDL and am trying to find a way to take a n bit (stored as a generic) signed number and truncate it to a form that requires the minimum number of bits.
For example, if I have 5 as its 8 bit signed number (stored in a std_logic_vector of length 8) 00000101, I'd like to make a function to return 0101 as a std_logic_vector. Any ideas on how I can accomplish this?
Since you have specified that you're using a signed value, you may want to use the signed type (from the numeric_std library) instead of the more generic std_logic_vector.
If your number is a compile time constant, you can write a function starting from the leftmost bit (in a for loop for example) that counts how many identical bits it sees, then returns signed_input(8-result downto 0). The issue with this is that as a compile time constant, there isn't much advantage in removing the redundant bits. The whole vector will be optimized away in synthesis.
You might want to include special cases to make the result at least 1 bit (0 technically doesn't need any bits to represest) or 2 bits (-1 only needs the sign bit to distinguish it from 0) depending on how you want to use your signed type value.
If your number is a real signal (the value changes during operation), you can still count the number of identical bits from the left, but variable location slicing of the vector will be iffy. Are you trying to pack the most of several numbers into a fixed bit width? Doing that will synthesize into multiplexers for each bit as well as the LUTs used for calculating the number of redundant bits for each of the numbers.

Arithmetic Operations using only 32 bit integers

How would you compute the multiplication of two 1024 bit numbers on a microprocessor that is only capable of multiplying 32 bit numbers?
The starting point is to realize that you already know how to do this: in elementary school you were taught how to do arithmetic on single digit numbers, and then given data structures to represent larger numbers (e.g. decimals) and algorithms to compute arithmetic operations (e.g. long division).
If you have a way to multiply two 32-bit numbers to give a 64-bit result (note that unsigned long long is guaranteed to be at least 64 bits), then you can use those same algorithms to do arithmetic in base 2^32.
You'll also need, e.g., an add with carry operation. You can determine the carry when adding two unsigned numbers of the same type by detecting overflow, e.g. as follows:
uint32_t x, y; // set to some value
uint32_t sum = x + y;
uint32_t carry = (sum < x);
(technically, this sort of operation requires that you do unsigned arithmetic: overflow in signed arithmetic is undefined behavior, and optimizers will do surprising things to your code you least expect it)
(modern processors usually give a way to multiply two 64-bit numbers to give a 128-bit result, but to access it you will have to use compiler extensions like 128-bit types, or you'll have to write inline assembly code. modern processors also have specialized add-with-carry instructions)
Now, to do arithmetic efficiently is an immense project; I found it quite instructive to browse through the documentation and source code to gmp, the GNU multiple precision arithmetic library.
look at any implementation of bigint operations
here are few of mine approaches in C++ for fast bignum square
some are solely for sqr but others are usable for multiplication...
use 32bit arithmetics as a module for 64/128/256/... bit arithmetics
see mine 32bit ALU in x86 C++
use long multiplication with digit base 2^32
can use also Karatsuba this way

splitting a 64-bit addition into two 32-bit additions

I'm working on an ultra-performance-intensive computational task. For adding-pairwise two 32-bit integer arrays, could one, on a 64-bit architecture, treat two 32-bit values as a single 64-bit value, add them to their complement on the other array, then split them up again with a bitwise & operator. Obviously if there is an overflow, they will not be the same, but assuming there is none, will there be a problem? (And can you continue this to 16 and 8 bit additions?)
Does the behavior change for unsigned vs signed?
There's no difference between signed and unsigned - on two's complement machine it's just one instruction that doesn't know about the sign. Yes, you can safely do this trick if there's no overflow risk and you can do this for subparts of any lengths, for example, you can think that your 64-bit number holds two 13-bit numbers and one 38-bit number.
If you assume no overflow, you can do this down to single bits. Of course, 1+1 overflows.
But in pratice, you either have overflow, or you really had 31 bit integers to start with.
One other thing: it only works on unsigned types. You can't have a sign bit in the middle of a 64 bit number.
But why do you care? If you're going "ultra-performance-intensive", use SSE. It will do parallel addition properly.
Yes, you can do this, but it would only work for unsigned values. With signed 32bit integers, the sign bit is the high order bit, which causes overflow when adding.
You probably don't need to do this - if your native C compiler isn't giving the performance you need, then look at using the vector operations (MMX, SSE etc) that do this sort of vector operations extremely efficiently.

Resources