Why do some arithmetic instructions have a signed/unsigned variant and some don't

Why do some arithmetic instructions have a signed/unsigned variant and some don't - cpu

Assume we have:
a = 0b11111001;
b = 0b11110011;
If we do Addition and Multiplication on paper with hand we get this result, we don't care if its signed or not:
a + b = 111101100
a * b = 1110110001011011
I know that Multiplication doubles the width and addition could overflow:
Why is imul used for multiplying unsigned numbers?
Why do some CPUs have different instructions to do signed and unsigned operations?
My question is, why instructions like Add don't usually have a signed/unsigned version, but Multiply and Divide do?
Why can't we have a generic unsigned multiply, do the math like I did above and truncate the result if its singed, same way Add does.
Or the other, why can't Add have a signed/unsigned version. I have checked a few architectures and this seems to be the case.

I think your choice of example misled you into thinking the signed product could be obtained by truncating the 8x8 => 16-bit unsigned product down to 8 bits. That is not the case.
(249-256) * (243-256) = 0x005b, a small positive result that happens to fit in the low half of the full result. But the full signed result is not always the operand-size truncation of the unsigned product.
For example, -128 * 127 is -16256, or as 16-bit 2's complement, 0xc080.
But 0x80 * 0x7f is + 16256, i.e. 0x3f80. Same low half, different upper half.
Or for another example, see Why are signed and unsigned multiplication different instructions on x86(-64)?
Widening signed-multiply doesn't involve any truncation. The low half of signed and unsigned multiply is the same, that's why x86 for example only has immediate and 2-operand forms of imul, not also mul. Only widening multiply needs a separate form. (Or if you want FLAGS set according to unsigned overflow of the low half, instead of signed overflow. - so you can't easily use non-widening imul if you want to detect when the full unsigned result didn't fit.)

Related

Can overflow occur when a positive number is subtracted from a positive number resulting in a negative number?

I am working with the mips assembly language but am confused on the overflow aspect of arithmetic here.
Say I am subtracting 25 from 20 and end up with -5. Would this result in an overflow?
I understand that with addition if you add 2 positive numbers or 2 negative numbers and the output is the opposite sign then there is overflow but am lost when it comes to subtraction.

Find the examples at the extreme, let's do it in 8 bits signed to make it simple but the same principles holds in 32 bits
minuend: the smallest possible positive (non-negative) number, 0 and
subtrahend: the largest possible number, 127
When we do 0 - 127, the answer is -127 and that indeed fits in 8 bits signed.  There is a borrow.  In any processor the effect of the borrow is to propagate 1's though out the upper bits, making it a negative number of the proper magnitude.
Different processors set flags differently based on this borrow, MIPS doesn't have flags, while x86 will set flags to indicate borrow, and other processors will set the flags to indicate carry.
In 8 bit signed numbers:
minuend: the smallest possible positive (non-negative) number, 0 and
subtrahend: the largest possible number, -128
When we do 0 - -128, the answer should be 128, but that cannot be represented in 8 bit signed format, so this is the example of overflow.  0 - -127 = 127 and that can be represented, so no overflow there.
If we do it in 8 bits unsigned, your example of 25-20 = -5, but -5 cannot be represented in an unsigned format, so this is indeed overflow (or modular arithmetic, if you like that).

Short answer: Yes, as the 32-bit representation of -5 is FFFFFFFB.
Long answer: Depends on what you mean by "overflow".
There's signed overflow, which is when you cross the 7FFFFFFF-80000000 boundary.
And there's unsigned overflow where you cross the FFFFFFFF-00000000 boundary.
For signed arithmetic, signed overflow is undeniably a bad thing (and is considered undefined behavior in C and other languages). However, unsigned overflow is not necessarily a problem. Usually it is, but many procedures rely on it to work.
For example, imagine you have a "frame timer" variable, i.e. a 32-bit counter variable that increments by 1 during an interrupt service routine. This interrupt is tied to a real-time clock running at 60 hertz, so every 1/60th of a second the variable's value increases by 1.
Now, this variable will overflow eventually. But do we really care? No. It just wraps around back to zero again. For our purposes, it's fine, since we really don't need to know that accurately how long our program has been running since it began. We probably have events that occur every n ticks of the timer, but we can just use a bitmask for that. Effectively in this case we're using unsigned overflow to say "if this value equals FFFFFFFF and we're about to add 1 to it, reset it to zero instead." Which thanks to overflow we can easily implement without any additional condition checking.
The reason I bring this up is so that you understand that overflow is not always a bad thing, if it's the unsigned variety. It depends entirely on what your data is intended to represent (which is something you can't explain even to a C compiler.)

Implementation of strtof(), floating-point multiplication and mantissa rounding issues

This question is not so much about the C as about the algorithm. I need to implement strtof() function, which would behave exactly the same as GCC one - and do it from scratch (no GNU MPL etc.).
Let's skip checks, consider only correct inputs and positive numbers, e.g. 345.6e7. My basic algorithm is:
Split the number into fraction and integer exponent, so for 345.6e7 fraction is 3.456e2 and exponent is 7.
Create a floating-point exponent. To do this, I use these tables:
static const float powersOf10[] = {
1.0e1f,
1.0e2f,
1.0e4f,
1.0e8f,
1.0e16f,
1.0e32f
};
static const float minuspowersOf10[] = {
1.0e-1f,
1.0e-2f,
1.0e-4f,
1.0e-8f,
1.0e-16f,
1.0e-32f
};
and get float exponent as a product of corresponding bits in integer exponent, e.g. 7 = 1+2+4 => float_exponent = 1.0e1f * 1.0e2f * 1.0e4f.
Multiply fraction by floating exponent and return the result.
And here comes the first problem: since we do a lot of multiplications, we get a somewhat big error becaule of rounding multiplication result each time. So, I decided to dive into floating point multiplication algorithm and implement it myself: a function takes a number of floats (in my case - up to 7) and multiplies them on bit level. Consider I have uint256_t type to fit mantissas product.
Now, the second problem: round mantissas product to 23 bits. I've tried several rounding methods (round-to-even, Von Neumann rounding - a small article about them), but no of them can give the correct result for all the test numbers. And some of them really confuse me, like this one:
7038531e-32. GCC's strtof() returns 0x15ae43fd, so correct unbiased mantissa is 2e43fd. I go for multiplication of 7.038531e6 (biased mantissa d6cc86) and 1e-32 (b.m. cfb11f). The resulting unbiased mantissa in binary form is
( 47)0001 ( 43)0111 ( 39)0010 ( 35)0001
( 31)1111 ( 27)1110 ( 23)1110 ( 19)0010
( 15)1011 ( 11)0101 ( 7)0001 ( 3)1101
which I have to round to 23 bits. However, by all rounding methods I have to round it up, and I'll get 2e43fe in result - wrong! So, for this number the only way to get correct mantissa is just to chop it - but chopping does not work for other numbers.
Having this worked on countless nights, my questions are:
Is this approach to strtof() correct? (I know that GCC uses GNU MPL for it, and tried to see into it. However, trying to copy MPL's implementation would require porting the entire library, and this is definitely not what I want). Maybe this split-then-multiply algorithm is inevitably prone to errors? I did some other small tricks, (e.g. create exponent tables for all integer exponents in float range), but they led to even more failed conversions.
If so, did I miss something while rounding? I thought so for long time, but this 7038531e-32 number completely confused me.

If I want to be as precise as I can I usually do stuff like this (however I usually do the reverse operation float -> text):
use only integers (no floats what so ever)
as you know float is integer mantissa bit-shifted by integer exponent so no need for floats.
For constructing the final float datatype you can use simple union with float and 32 bit unsigned integer in it ... or pointers to such types pointing to the same address.
This will avoid rounding errors for numbers that fit completely and shrink error for those that don't fit considerably.
use hex numbers
You can convert your text of decadic number on the run into its hex counterpart (still as text) from there creating mantissa and exponent integers is simple.
Here:
How to convert a gi-normous integer (in string format) to hex format? (C#)
is C++ implementation example of dec2hex and hex2dec number conversions done on text
use more bits for mantissa while converting
for task like this and single precision float I usually use 2 or 3 32 bit DWORDs for the 24 bit mantissa to still hold some precision after the multiplications If you want to be precise you have to deal with 128+24 bits for both integer and fractional part of number so 5x32 bit numbers in sequence.
For more info and inspiration see (reverse operation):
my best attempt to print 32 bit floats with least rounding errors (integer math only)
Your code will be just inverse of that (so many parts will be similar)
Since I post that I made even more advanced version that recognize formatting just like printf , supports much more datatypes and more without using any libs (however its ~22.5 KByte of code). I needed it for MCUs as GCC implementation of prints are not very good there ...

instrinsic _mm512_round_ps is missing for AVX512

I'm missing the intrinsic _mm512_round_ps for AVX512 (it is only available for KNC). Any idea why this is not available?
What would be a good workaround?
apply _mm256_round_ps to upper and lower half and fuse the results?
use _mm512_add_round_ps with one argument being zero?
Thanks!

TL:DR: AVX512F
__m512 nearest_integer = _mm512_roundscale_ps(input_vec, _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC);
related: AVX512DQ _mm512_reduce_pd or _ps will subtract the integer part (and a specified number of leading fraction bits), range-reducing your input to only the fractional part. asm docs for vreducepd have the most detail.
The EVEX prefix allows overriding the default rounding direction {er} and setting suppress-all-exceptions {sae}, for FP instructions. (This is what the ..._round_ps() versions of intrinsics are for.) But it doesn't have a "round to integer" option; you still need a separate asm instruction for that.
vroundps xy, xy/mem, imm8 didn't get upgraded to AVX512. Actually it did: the same opcode has a new mnemonic for the EVEX version, using the high 4 bits of the immediate that are reserved in the SSE and VEX encodings.
vrndscaleps xyz, xyz/mem/m32broadcast, imm8 is available in ss/sd/ps/pd flavours. The high 4 bits of the imm8 specify the number of fraction bits to round to. In these terms, rounding to the nearest integer is rounding to 0 fraction bits. Rounding to nearest 0.5 would be rounding to 1 fraction bit. It's the same as scaling by 2^M, rounding to nearest integer, then scaling back down (done without overflow).
I think the field is unsigned, so you can't use M=-1 to round to an even number. The ISA ref manual doesn't mention signedness, so I'm leaning towards unsigned being the most likely.
The low 4 bits of the field specify the rounding mode like with roundps. As usual, the PD version of the instruction has the diagram (because it's alphabetically first).
With the upper 4 bits = 0, it behaves the same as roundps: they use the same encoding for the low 4 bits. It's not a coincidence that the instructions have the same opcode, just different prefixes.
(I'm curious if SSE or VEX roundpd on an AVX512 CPU would actually scale based on the upper 4 bits; it says they're "reserved" not "ignored". But probably not.)
__m512 _mm512_roundscale_ps( __m512 a, int imm); is the no-frills intrinsic. See Intel's intrinsic finder
The merge-masking + SAE-override version is __m512 _mm512_mask_roundscale_round_ps(__m512 s, __mmask16 k, __m512 a, int imm, int sae);. There's nothing you can do with the sae operand that roundscale can't already do with its imm8, though, so it's a bit pointless.
You can use the _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC and so on constants documented for _mm_round_pd / _mm256_round_pd, to round up, down, or truncate towards zero, or the usual nearest with even-as-tiebreak that's the IEEE default rounding mode. Or _MM_FROUND_CUR_DIRECTION to use whatever the current mode is. _MM_FROUND_NO_EXC suppresses setting the inexact exception bit in the MXCSR.
You might be wondering why vrndscaleps needs any immediate bits to specify rounding direction when you could just use the EVEX prefix to override the rounding direction with vrndscaleps zmm0 {k1}, zmm1, {rz-sae} (Or whatever the right syntax is; NASM doesn't seem to be accepting any of the examples I found.)
The answer is that explicit rounding is only available with 512-bit vectors or with scalars, and only for register operands. (It repurposes 3 EVEX bits used to set vector length (if AVX512VL is supported), and to distinguish between broadcast memory operands vs. vector. EVEX bits are overloaded based on context to pack more functionality into limited space.)
So having the rounding-control in the imm8 makes it possible to do vrndscaleps zmm0{k1}, [rdi]{m32bcst}, imm8 to broadcast a float from memory, round it, and merge that into an existing register according to mask register k1. All in a single instruction which decodes to probably 3 uops on SKX, assuming it's the same as vroundps. (http://agner.org/optimize/).

Arithmetic Operations using only 32 bit integers

How would you compute the multiplication of two 1024 bit numbers on a microprocessor that is only capable of multiplying 32 bit numbers?

The starting point is to realize that you already know how to do this: in elementary school you were taught how to do arithmetic on single digit numbers, and then given data structures to represent larger numbers (e.g. decimals) and algorithms to compute arithmetic operations (e.g. long division).
If you have a way to multiply two 32-bit numbers to give a 64-bit result (note that unsigned long long is guaranteed to be at least 64 bits), then you can use those same algorithms to do arithmetic in base 2^32.
You'll also need, e.g., an add with carry operation. You can determine the carry when adding two unsigned numbers of the same type by detecting overflow, e.g. as follows:
uint32_t x, y; // set to some value
uint32_t sum = x + y;
uint32_t carry = (sum < x);
(technically, this sort of operation requires that you do unsigned arithmetic: overflow in signed arithmetic is undefined behavior, and optimizers will do surprising things to your code you least expect it)
(modern processors usually give a way to multiply two 64-bit numbers to give a 128-bit result, but to access it you will have to use compiler extensions like 128-bit types, or you'll have to write inline assembly code. modern processors also have specialized add-with-carry instructions)
Now, to do arithmetic efficiently is an immense project; I found it quite instructive to browse through the documentation and source code to gmp, the GNU multiple precision arithmetic library.

look at any implementation of bigint operations
here are few of mine approaches in C++ for fast bignum square
some are solely for sqr but others are usable for multiplication...
use 32bit arithmetics as a module for 64/128/256/... bit arithmetics
see mine 32bit ALU in x86 C++
use long multiplication with digit base 2^32
can use also Karatsuba this way

How can you deal with BOTH signed and unsigned numbers in VHDL?

I'm writing a program that needs to work for signed AND unsigned numbers. You take a 32 bit input, first 24 bits is a whole number, last 8 bits is a fraction. Depending on what the fraction is you round up or down. Pretty simple, but how would you write a program that will work whether the input is signed OR unsigned? Do you just make two separate code blocks that execute depending on if a number is unsigned or not?

Your program would need to be aware of the source if the data, and from that information derive whether or not the number is signed. Otherwise, how is your program to know whether a vector of bits is (un)signed? Signage is a convention for humans to use to structure data. The hardware you implement just sees a vector of bits.

A 32-bit unsigned number with 8 fraction bits can represent numbers in the range 0 to ((2^32)-1)/256.
A 32-bit signed number with 8 fraction bits can represent numbers in the range -(2^31)/256 to ((2^31)-1)/256.
So, how about converting your 32-bit input (signed or unsigned) to 33-bit signed, which will be able to represent numbers in the range -(2^32)/256 to ((2^32)-1)/256, which will cover your whole range of inputs.
(You have not given any code. In addition to your 32-bit input, there must be some other input to signal whether those 32 bits represent an unsigned or a signed number. You'll need to test that input and do the appropriate conversion based on its state.)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio