Gathering half-float values using AVX

Gathering half-float values using AVX - intrinsics

Using AVX/AVX2 intrinsics, I can gather sets of 8 values, either 1,2 or 4 byte integers, or 4 byte floats using:
_mm256_i32gather_epi32()
_mm256_i32gather_ps()
But currently, I have a case where I am loading data that was generated on an nvidia GPU and stored as FP16 values. How can I do vectorized loads of these values?
So far, I found the _mm256_cvtph_ps() intrinsic.
However, input for that intrinsic is a __m128i value, not a __m256i value.
Looking at the Intel Intrinsics Guide, I see no gather operations that store 8 values into an _mm128i register?
How can I gather FP16 values into the 8 lanes of a __m256 register? Is it possible to vector load them as 2-byte shorts into __m256i and then somehow reduce that to a __m128i value to be passed into the conversion intrinsic? If so, I haven't found intrinsics to do that.
UPDATE
I tried the cast as suggested by #peter-cordes but I am getting bogus results from that. Also, I don't understand how that could work?
My 2-byte int values are stored in __m256i as:
0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX
so how can I simply cast to __m128i where it needs to be tightly packed as
XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX
Will the cast do that?
My current code:
__fp16* fielddensity = ...
__m256i indices = ...
__m256i msk = _mm256_set1_epi32(0xffff);
__m256i d = _mm256_and_si256(_mm256_i32gather_epi32(fielddensity,indices,2), msk);
__m256 v = _mm256_cvtph_ps(_mm256_castsi256_si128(d));
But the result doesn't seem to be 8 properly formed values. I think every 2nd one is currently bogus for me?

There is indeed no gather instruction for 16bit values so you need to gather 32 bit values and ignore one half of them (and make sure that you don't accidentally read from invalid memory). Also, _mm256_cvtph_ps() needs all input values in the lower 128 bit lane and unfortunately, there is no lane-crossing 16 bit shuffle (until AVX512).
However, assuming you have only finite input values, you could do some bit-twiddling (avoiding the _mm256_cvtph_ps()). If you load a half precision value into the upper half of a 32 bit register you can do the following operations:
SEEEEEMM MMMMMMMM XXXXXXXX XXXXXXXX // input Sign, Exponent, Mantissa, X=garbage
Shift arithmetically to the right by 3 (this keeps the sign bit where it needs to be):
SSSSEEEE EMMMMMMM MMMXXXXX XXXXXXXX
Mask away excessive sign bits and garbage at the bottom (with 0b1000'11111'11111111111'0000000000000)
S000EEEE EMMMMMMM MMM00000 00000000
This will be a valid single precision float but the exponent will be off by 112=127-15 (the difference between the biases), i.e. you need to multiply these values by 2**112 (this may be combined with any subsequent operation, you intend to do anyway later). Note that this will also convert sub-normal float16 values to the corresponding sub-normal float32 value (which are also off by a factor of 2**112).
Untested intrinsic version:
__m256 gather_fp16(__fp16 const* fielddensity, __m256i indices){
// subtract 2 bytes from base address to load data into high parts:
int32_t const* base = (int32_t const*) ( fielddensity - 1);
// Gather 32bit values.
// Be aware that this reads two bytes before each desired value,
// i.e., make sure that reading fielddensitiy[-1] is ok!
__m256i d = _mm256_i32gather_epi32(base, indices, 2);
// shift exponent bits to the right place and mask away excessive bits:
d = _mm256_and_si256(_mm256_srai_epi32(d, 3), _mm256_set1_epi32(0x8fffe000));
// scale values to compensate bias difference (could be combined with subsequent operations ...)
__m256 two112 = _mm256_castsi256_ps(_mm256_set1_epi32(0x77800000)); // 2**112
__m256 f = _mm256_mul_ps(_mm256_castsi256_ps(d), two112);
return f;
}

Related

Can I set TIMx ARR value to initialize at chosen value?

So I am trying to use a rotary encoders to control menu on my STM32 project. I am using two rotary encoders to control each side of the screen (split menu).
When I initialize the ARR registers of both timers which are responsible for counting the encoder's pulses, it initializes the registers to a 0 value and when I move the encoders counterclockwise the registers overflow and goes to maximum values of 65535 and messes with how my code calculates detents.
Can you guys tell me if there is any way to set the the TIM->CNT value to a custom value somewhere in the middle between 0 and 65535 ?
This way I could easily check differences between values and not worry about the jump in numbers.

when I move the encoders counterclockwise the registers overflow and goes to maximum values of 65535 and messes with how my code calculates detents.
The counter is a 16 bit value put into a 32 bit unsigned register. To get a proper signed value, cast it to int16_t.
int wheelposition = (int16_t)TIMx->CNT;
A value of 65535 (0xFFFF) would be sign-extended to 0xFFFFFFFF, which is interpreted as -1 in a 32 bit integer variable. But then you'd have the problem that it'd overflow from -32768 to +32767.
If you are interested in the signed difference of two position readings, you can do the subtraction on the unsigned values, and cast the result to int16_t.
uint32_t oldposition, newposition;
int wheelmovement;
oldposition = TIMx->CNT;
/* wait a bit */
newposition = TIMx->CNT;
wheelmovement = (int16_t)(newposition - oldposition);
It will give you the signed difference, with 16 bit overflow taken into account.
is any way to set the the TIM->CNT value to a custom value somewhere in the middle between 0 and 65535 ?
You can simply assign any value to TIMx->CNT, it will continue counting from there.

Strange pointer arithmetic

I came across too strange behaviour of pointer arithmetic. I am developing a program to develop SD card from LPC2148 using ARM GNU toolchain (on Linux). My SD card a sector contains data (in hex) like (checked from linux "xxd" command):
fe 2a 01 34 21 45 aa 35 90 75 52 78
While printing individual byte, it is printing perfectly.
char *ch = buffer; /* char buffer[512]; */
for(i=0; i<12; i++)
debug("%x ", *ch++);
Here debug function sending output on UART.
However pointer arithmetic specially adding a number which is not multiple of 4 giving too strange results.
uint32_t *p; // uint32_t is typedef to unsigned long.
p = (uint32_t*)((char*)buffer + 0);
debug("%x ", *p); // prints 34012afe // correct
p = (uint32_t*)((char*)buffer + 4);
debug("%x ", *p); // prints 35aa4521 // correct
p = (uint32_t*)((char*)buffer + 2);
debug("%x ", *p); // prints 0134fe2a // TOO STRANGE??
Am I choosing any wrong compiler option? Pls help.
I tried optimization options -0 and -s; but no change.
I could think of little/big endian, but here i am getting unexpected data (of previous bytes) and no order reversing.

Your CPU architecture must support unaligned load and store operations.
To the best of my knowledge, it doesn't (and I've been using STM32, which is an ARM-based cortex).
If you try to read a uint32_t value from an address which is not divisible by the size of uint32_t (i.e. not divisible by 4), then in the "good" case you will just get the wrong output.
I'm not sure what's the address of your buffer, but at least one of the three uint32_t read attempts that you describe in your question, requires the processor to perform an unaligned load operation.
On STM32, you would get a memory-access violation (resulting in a hard-fault exception).
The data-sheet should provide a description of your processor's expected behavior.
UPDATE:
Even if your processor does support unaligned load and store operations, you should try to avoid using them, as it might affect the overall running time (in comparison with "normal" load and store operations).
So in either case, you should make sure that whenever you perform a memory access (read or write) operation of size N, the target address is divisible by N. For example:
uint08_t x = *(uint08_t*)y; // 'y' must point to a memory address divisible by 1
uint16_t x = *(uint16_t*)y; // 'y' must point to a memory address divisible by 2
uint32_t x = *(uint32_t*)y; // 'y' must point to a memory address divisible by 4
uint64_t x = *(uint64_t*)y; // 'y' must point to a memory address divisible by 8
In order to ensure this with your data structures, always define them so that every field x is located at an offset which is divisible by sizeof(x). For example:
struct
{
uint16_t a; // offset 0, divisible by sizeof(uint16_t), which is 2
uint08_t b; // offset 2, divisible by sizeof(uint08_t), which is 1
uint08_t a; // offset 3, divisible by sizeof(uint08_t), which is 1
uint32_t c; // offset 4, divisible by sizeof(uint32_t), which is 4
uint64_t d; // offset 8, divisible by sizeof(uint64_t), which is 8
}
Please note, that this does not guarantee that your data-structure is "safe", and you still have to make sure that every myStruct_t* variable that you are using, is pointing to a memory address divisible by the size of the largest field (in the example above, 8).
SUMMARY:
There are two basic rules that you need to follow:
Every instance of your structure must be located at a memory address which is divisible by the size of the largest field in the structure.
Each field in your structure must be located at an offset (within the structure) which is divisible by the size of that field itself.
Exceptions:
Rule #1 may be violated if the CPU architecture supports unaligned load and store operations. Nevertheless, such operations are usually less efficient (requiring the compiler to add NOPs "in between"). Ideally, one should strive to follow rule #1 even if the compiler does support unaligned operations, and let the compiler know that the data is well aligned (using a dedicated #pragma), in order to allow the compiler to use aligned operations where possible.
Rule #2 may be violated if the compiler automatically generates the required padding. This, of course, changes the size of each instance of the structure. It is advisable to always use explicit padding (instead of relying on the current compiler, which may be replaced at some later point in time).

LDR is the ARM instruction to load data. You have lied to the compiler that the pointer is a 32bit value. It is not aligned properly. You pay the price. Here is the LDR documentation,
If the address is not word-aligned, the loaded value is rotated right by 8 times the value of bits [1:0].
See: 4.2.1. LDR and STR, words and unsigned bytes, especially the section Address alignment for word transfers.
Basically your code is like,
p = (uint32_t*)((char*)buffer + 0);
p = (p>>16)|(p<<16);
debug("%x ", *p); // prints 0134fe2a
but has encoded to one instruction on the ARM. This behavior is dependent on the ARM CPU type and possibly co-processor values. It is also highly non-portable code.

It's called "undefined behavior". Your code is casting a value which is not a valid unsigned long * into an unsigned long *. The semantics of that operation are undefined behavior, which means pretty much anything can happen*.
In this case, the reason two of your examples behaved as you expected is because you got lucky and buffer happened to be word-aligned. Your third example was not as lucky (if it was, the other two would not have been), so you ended up with a pointer with extra garbage in the 2 least significant bits. Depending on the version of ARM you are using, that could result in an unaligned read (which it appears is what you were hoping for), or it could result in an aligned read (using the most significant 30 bits) and a rotation (word rotated by the number of bytes indicated in the least significant 2 bits). It looks pretty clear that the later is what happened in your 3rd example.
Anyway, technically, all 3 of your example outputs are correct. It would also be correct for the program to crash on all 3 of them.
Basically, don't do that.
A safer alternative is to write the bytes into a uint32_t. Something like:
uint32_t w;
memcpy(&w, buffer, 4);
debug("%x ", w);
memcpy(&w, buffer+4, 4);
debug("%x ", w);
memcpy(&w, buffer+2, 4);
debug("%x ", w);
Of course, that's still assuming sizeof(uint32_t) == 4 && CHAR_BITS == 8, but that's a much safer assumption. (Ie, it should work on pretty much any machine with 8 bit bytes.)

Pass two integers as one integer

I have two integers that I need to pass through one integer and then get the values of two integers back.
I am thinking of using Logic Operators (AND, OR, XOR, etc) .

Using the C programming language, it could be done as follows assuming that the two integers are less than 65535.
void take2IntegersAsOne(int x)
{
// int1 is stored in the bottom half of x, so take just that part.
int int1 = x & 0xFFFF;
// int2 is stored in the top half of x, so slide that part of the number
// into the bottom half, and take just that part.
int int2 = (x >> 16) & 0xFFFF
// use int1 and int2 here. They must both be less than 0xFFFF or 65535 in decimal
}
void pass2()
{
int int1 = 345;
int int2 = 2342;
take2Integers( int1 | (int2 << 16) );
}
This relies on the fact that in C an integer is stored in 4 bytes. So, the example uses the first two bytes to store one of the integers, and the next two bytes for the second. This does impose the limit though that each of the integers must have a small enough value so that they will each fit into just 2 bytes.
The shift operators << and >> are used to slide the bits of an integer up and down. Shifting by 16, moves the bits by two bytes (as there are 8 bits per byte).
Using 0xFFFF represents the bit pattern where all of the bits in the lower two bytes of the number are 1s So, ANDing (with with & operator) causes all the bits that are not in these bottom two bytes to be switched off (back to zero). This can be used to remove any parts of the 'other integer' from the one you're currently extracting.

There are two parts to this question. First, how do you bitmask two 32-bit Integers into a 64-bit Long Integer?
As others have stated, let's say I have a function that takes an X and Y coordinate, and returns a longint representing that Point's linear value. I tend to call this linearization of 2d data:
public long asLong(int x, int y) {
return ( ((long)x) << 32 ) | y;
}
public int getX(long location) {
return (int)((location >> 32) & 0xFFFFFFFF);
}
public int getY(long location) {
return (int)(location & 0xFFFFFFFF);
}
Forgive me if I'm paranoid about order of operations, sometimes other operations are greedier than <<, causing things to shift further than they should.
Why does this work? When might it fail?
It's convenient that integers tend to be exactly half the size of longints. What we're doing is casting x to a long, shifting it left until it sits entirely to the left of y, and then doing a union operation (OR) to combine the bits of both.
Let's pretend they're 4-bit numbers being combined into an 8-bit number:
x = 14 : 1110
y = 5 : 0101
x = x << 4 : 1110 0000
p = x | y : 1110 0000
OR 0101
---------
1110 0101
Meanwhile, the reverse:
p = 229 : 1110 0101
x = p >> 4 : 1111 1110 //depending on your language and data type, sign extension
//can cause the bits to smear on the left side as they're
//shifted, as shown here. Doesn't happen in unsigned types
x = x & 0xF:
1111 1110
AND 0000 1111
-------------
0000 1110 //AND selects only the bits we have in common
y = p & 0xF:
1110 0101
AND 0000 1111
-------------
0000 0101 //AND strikes again
This sort of approach came into being a long time ago, in environments that needed to squeeze every bit out of their storage or transmission space. If you're not on an embedded system or immediately packing this data for transmission over a network, the practicality of this whole procedure starts to break down really rapidly:
It's way too much work just for boxing a return value that almost always immediately needs to be unboxed and read by the caller. That's kind of like digging a hole and then filling it in.
It greatly reduces your code readability. "What type is returned?" Uh... an int.. and another int... in a long.
It can introduce hard-to-trace bugs down the line. For instance, if you use unsigned types and ignore the sign extension, then later on migrate to a platform that causes those types to go two's complement. If you save off the longint, and try to read it later in another part of your code, you might hit an off-by-one error on the bitshift and spend an hour debugging your function only to find out it's the parameter that's wrong.
If it's so bad, what are the alternatives?
This is why people were asking you about your language. Ideally, if you're in something like C or C++, it'd be best to say
struct Point { int x; int y; };
public Point getPosition() {
struct Point result = { 14,5 };
return result;
}
Otherwise, in HLLs like Java, you might wind up with an inner class to achieve the same functionality:
public class Example {
public class Point {
public int x;
public int y;
public Point(int x, int y) { this.x=x; this.y=y; }
}
public Point getPosition() {
return new Point(14,5);
}
}
In this case, getPosition returns an Example.Point - if you keep using Point often, promote it to a full class of its own. In fact, java.awt has several Point classes already, including Point and Point.Float
Finally, many modern languages now have syntactic sugar for either boxing multiple values into tuples or directly returning multiple values from a function. This is kind of a last resort. In my experience, any time you pretend that data isn't what it is, you wind up with problems down the line. But if your method absolutely must return two numbers that really aren't part of the same data at all, tuples or arrays are the way to go.
The reference for the c++ stdlib tuple can be found at
http://www.cplusplus.com/reference/std/tuple/

Well.. #Felice is right, but if they both fit in 16 bit there's a way:
output_int = (first_int << 16) | second_int
^
means 'or'
to pack them, and
first_int = output_int & 0xffff
second_int = (output int >> 16) & 0xffff
^
means 'and'
to extract them.

Two integer can't fit one integer, or at least you cant get back the two original one.
But anyway, if the two original integer are bounded to a sure number of bits you can ( in pseudocode ):
First integer
OR with
(Second integer SHIFTLEFT(nOfBits))
for getting back the two integer
mask the merged integer with a number that is binary represented by nOfBitsOne and you obtain the first integer, then
ShiftRight by nOfBits the merged integer, and you have back the second.

You could store 2 16-bit integers within a 32-bit integer. First one i 16 first bits and second one in the last 16 bits. To retrieve and compose the value you use shift-operators.

linear interpolation on 8bit microcontroller

I need to do a linear interpolation over time between two values on an 8 bit PIC microcontroller (Specifically 16F627A but that shouldn't matter) using PIC assembly language. Although I'm looking for an algorithm here as much as actual code.
I need to take an 8 bit starting value, an 8 bit ending value and a position between the two (Currently represented as an 8 bit number 0-255 where 0 means the output should be the starting value and 255 means it should be the final value but that can change if there is a better way to represent this) and calculate the interpolated value.
Now PIC doesn't have a divide instruction so I could code up a general purpose divide routine and effectivly calculate (B-A)/(x/255)+A at each step but I feel there is probably a much better way to do this on a microcontroller than the way I'd do it on a PC in c++
Has anyone got any suggestions for implementing this efficiently on this hardware?

The value you are looking for is (A*(255-x)+B*x)/255. It requires only 8x8 multiplication, and a final division by 255, which can be approximated by simply taking the high byte of the sum.
Choosing x in range 0..128, no approximation is needed: take the high byte of (A*(128-x)+B*x)<<1.

Assuming you interpolate a sequence of values where the previous endpoint is the new start point:
(B-A)/(x/255)+A
sounds like a bad idea. If you use base 255 as a fixedpoint representation, you get the same interpolant twice. You get B when x=255 and B as the new A when x=0.
Use 256 as the fixedpoint system. Divides become shifts, but you need 16-bit arithmetic and 8x8 multiplication with a 16-bit result. The previous issue can be fixed by simply ignoring any bits in the higher-bytes as x mod 256 becomes 0. This suggestion uses 16-bit multiplication, but can't overflow. and you don't interpolate over the same x twice.
interp = (a*(256 - x) + b*x) >> 8
256 - x becomes just a subtract-with-borrow, as you get 0 - x.
The PIC lacks these operations in its instruction set:
Right and left shift. (both logical and arithmetic)
Any form of multiplication.
You can get right-shifting by using rotate-right instead, followed by masking out the extra bits on the left with bitwise-and. A straight-forward way to do 8x8 multiplication with 16-bit result:
void mul16(
unsigned char* hi, /* in: operand1, out: the most significant byte */
unsigned char* lo /* in: operand2, out: the least significant byte */
)
{
unsigned char a,b;
/* loop over the smallest value */
a = (*hi <= *lo) ? *hi : *lo;
b = (*hi <= *lo) ? *lo : *hi;
*hi = *lo = 0;
while(a){
*lo+=b;
if(*lo < b) /* unsigned overflow. Use the carry flag instead.*/
*hi++;
--a;
}
}

The techniques described by Eric Bainville and Mads Elvheim will work fine; each one uses two multiplies per interpolation.
Scott Dattalo and Tony Kubek have put together a super-optimized PIC-specific interpolation technique called "twist" that is slightly faster than two multiplies per interpolation.
Is using this difficult-to-understand technique worth running a little faster?

You could do it using 8.8 fixed-point arithmetic. Then a number from range 0..255 would be interpreted as 0.0 ... 0.996 and you would be able to multiply and normalize it.
Tell me if you need any more details or if it's enough for you to start.

You could characterize this instead as:
(B-A)*(256/(x+1))+A
using a value range of x=0..255, precompute the values of 256/(x+1) as a fixed-point number in a table, and then code a general purpose multiply, adjust for the position of the binary point. This might not be small spacewise; I'd expect you to need a 256 entry table of 16 bit values and the multiply code. (If you don't need speed, this would suggest your divison method is fine.). But it only takes one multiply and an add.
My guess is that you don't need every possible value of X. If there are only a few values of X, you can compute them offline, do a case-select on the specific value of X and then implement the multiply in terms of a fixed sequence of shifts and adds for the specific value of X. That's likely to be pretty efficient in code and very fast for a PIC.

Interpolation
Given two values X & Y , its basically:
(X+Y)/2
or
X/2 + Y/2 (to prevent the odd-case that A+B might overflow the size of the register)
Hence try the following:
(Pseudo-code)
Initially A=MAX, B=MIN
Loop {
Right-Shift A by 1-bit.
Right-Shift B by 1-bit.
C = ADD the two results.
Check MSB of 8-bit interpolation value
if MSB=0, then B=C
if MSB=1, then A=C
Left-Shift 8-bit interpolation value
}Repeat until 8-bit interpolation value becomes zero.
The actual code is just as easy. Only i do not remember the registers and instructions off-hand.

How do I detect overflow while multiplying two 2's complement integers?

I want to multiply two numbers, and detect if there was an overflow. What is the simplest way to do that?

Multiplying two 32 bit numbers results in a 64 bit answer, two 8s give a 16, etc. binary multiplication is simply shifting and adding. so if you had say two 32 bit operands and bit 17 set in operand A and any of the bits above 15 or 16 set in operand b you will overflow a 32 bit result. bit 17 shifted left 16 is bit 33 added to a 32.
So the question again is what are the size of your inputs and the size of your result, if the result is the same size then you have to find the most significant 1 of both operands add those bit locations if that result is bigger than your results space you will overflow.
EDIT
Yes multiplying two 3 bit numbers will result in either a 5 bit number or 6 bit number if there is a carry in the add. Likewise a 2 bit and 5 bit can result in 6 or 7 bits, etc. If the reason for this question posters question is to see if you have space in your result variable for an answer then this solution will work and is relatively fast for most languages on most processors. It can be significantly faster on some and significantly slower on others. It is generically fast (depending on how it is implemented of course) to just look at the number of bits in the operands. Doubling the size of the largest operand is a safe bet if you can do it within your language or processor. Divides are downright expensive (slow) and most processors dont have one much less at an arbitrary doubling of operand sizes. The fastest of course is to drop to assembler do the multiply and look at the overflow bit (or compare one of the result registers with zero). If your processor cant do the multiply in hardware then it is going to be slow no matter what you do. I am guessing that asm is not the right answer to this post despite being by far the fastest and has the most accurate overflow status.
binary makes multiplication trivial compared to decimal, for example take the binary numbers
0b100 *
0b100
Just like decimal math in school you (can) start with the least significant bit on the lower operand and multiply it against all the locations in the upper operand, except with binary there are only two choices you multiply by zero meaning you dont have to add to the result, or you multiply by one which means you just shift and add, no actual multiplication is necessary like you would have in decimal.
000 : 0 * 100
000 : 0 * 100
100 : 1 * 100
Add up the columns and the answer is 0b10000
Same as decimal math a 1 in the hundreds column means copy the top number and add two zeros, it works the same in any other base as well. So 0b100 times 0b110 is 0b1000, a one in the second column over so copy and add a zero + 0b10000 a one in the third column over so copy and add two zeros = 0b11000.
This leads to looking at the most significant bits in both numbers. 0b1xx * 0b1xx guarantees a 1xxxx is added to the answer, and that is the largest bit location in the add, no other single inputs to the final add have that column populated or a more significant column populated. From there you need only more bit in case the other bits being added up cause a carry.
Which happens with the worst case all ones times all ones, 0b111 * 0b111
0b00111 +
0b01110 +
0b11100
This causes a carry bit in the addition resulting in 0b110001. 6 bits. a 3 bit operand times a 3 bit operand 3+3=6 6 bits worst case.
So size of the operands using the most significant bit (not the size of the registers holding the values) determines the worst case storage requirement.
Well, that is true assuming positive operands. If you consider some of these numbers to be negative it changes things but not by much.
Minus 4 times 5, 0b1111...111100 * 0b0000....000101 = -20 or 0b1111..11101100
it takes 4 bits to represent a minus 4 and 4 bits to represent a positive 5 (dont forget your sign bit). Our result required 6 bits if you stripped off all the sign bits.
Lets look at the 4 bit corner cases
-8 * 7 = -56
0b1000 * 0b0111 = 0b1001000
-1 * 7 = -7 = 0b1001
-8 * -8 = 64 = 0b01000000
-1 * -1 = 2 = 0b010
-1 * -8 = 8 = 0b01000
7 * 7 = 49 = 0b0110001
Lets say we count positive numbers as the most significant 1 plus one and negative the most significant 0 plus one.
-8 * 7 is 4+4=8 bits actual 7
-1 * 7 is 1+4=5 bits, actual 4 bits
-8 * -8 is 4+4=8 bits, actual 8 bits
-1 * -1 is 1+1=2 bits, actual 3 bits
-1 * -8 is 1+4=5 bits, actual 5 bits
7 * 7 is 4+4=8 bits, actual 7 bits.
So this rule works, with the exception of -1 * -1, you can see that I called a minus one one bit, for the plus one thing find the zero plus one. Anyway, I argue that if this were a 4 bit * 4 bit machine as defined, you would have 4 bits of result at least and I interpret the question as how may more than 4 bits do I need to safely store the answer. So this rule serves to answer that question for 2s complement math.
If your question was to accurately determine overflow and then speed is secondary, then, well it is going to be really really slow for some systems, for every multiply you do. If this is the question you are asking, to get some of the speed back you need to tune it a little better for the language and/or processor. Double up the biggest operand, if you can, and check for non-zero bits above the result size, or use a divide and compare. If you cant double the operand sizes, divide and compare. Check for zero before the divide.
Actually your question doesnt specify what size of overflow you are talking about either. Good old 8086 16 bit times 16 bit gives a 32 bit result (hardware), it can never overflow. What about some of the ARMs that have a multiply, 32 bit times 32 bit, 32 bit result, easy to overflow. What is the size of your operands for this question, are they the same size or are they double the input size? Are you willing to perform multiplies that the hardware cannot do (without overflowing)? Are you writing a compiler library and trying to determine if you can feed the operands to the hardware for speed or if you have to perform the math without a hardware multiply. Which is the kind of thing you get if you cast up the operands, the compiler library will try to cast the operands back down before doing the multiply, depending on the compiler and its library of course. And it will use the count the bit trick determine to use the hardware multiply or a software one.
My goal here was to show how binary multiply works in a digestible form so you can see how much maximum storage you need by finding the location of a single bit in each operand. Now how fast you can find that bit in each operand is the trick. If you were looking for minimum storage requirements not maximum that is a different story because involves every single one of the significant bits in both operands not just one bit per operand, you have to do the multiply to determine minimum storage. If you dont care about maximum or minimum storage you have to just do the multiply and look for non zeros above your defined overflow limit or use a divide if you have the time or hardware.
Your tags imply you are not interested in floating point, floating point is a completely different beast, you cannot apply any of these fixed point rules to floating point, they DO NOT work.

Check if one is less than a maximum value divided by the other. (All values are taken as absolute).
2's complementness hardly has anything to do with it, since the multiplication overflows if x*(2n - x)>2M, which is equal to (x*2n - x2)>2M, or x2 < (x*2n - 2M), so you'll have to compare overflowing numbers anyway (x2 may overflow, while result may not).

If your number are not from the largest integral data type, then you might just cast them up, multiply and compare with the maximum of the number's original type. E.g. in Java, when multiplying two int, you can cast them to long and compare the result to Integer.MAX_VALUE or Integer.MIN_VALUE (depending on sign combination), before casting the result down to int.
If the type already is the largest, then check if one is less than the maximum value divided by the other. But do not take the absolute value! Instead you need separate comparison logic for each of the sign combinations negneg, pospos and posneg (negpos can obviously be reduced to posneg, and pospos might be reduced to neg*neg). First test for 0 arguments to allow safe divisions.
For actual code, see the Java source of MathUtils class of the commons-math 2, or ArithmeticUtils of commons-math 3. Look for public static long mulAndCheck(long a, long b). The case for positive a and b is
// check for positive overflow with positive a, positive b
if (a <= Long.MAX_VALUE / b) {
ret = a * b;
} else {
throw new ArithmeticException(msg);
}

I want to multiply two (2's complement) numbers, and detect if there was an overflow. What is the simplest way to do that?
Various languages do not specify valid checking for overflow after it occurs and so prior tests are required.
With some types, a wider integer type may not exist, so a general solution should limit itself to a single type.
The below (Ref) only requires compares and known limits to the integer range. It returns 1 if a product overflow will occur, else 0.
int is_undefined_mult1(int a, int b) {
if (a > 0) {
if (b > 0) {
return a > INT_MAX / b; // a positive, b positive
}
return b < INT_MIN / a; // a positive, b not positive
}
if (b > 0) {
return a < INT_MIN / b; // a not positive, b positive
}
return a != 0 && b < INT_MAX / a; // a not positive, b not positive
}
Is this the simplest way?
Perhaps, yet it is complete and handle all cases known to me - including rare non-2's complement.

Alternatives to Pavel Shved's solution ...
If your language of choice is assembler, then you should be able to check the overflow flag. If not, you could write a custom assembler routine that sets a variable if the overflow flag was set.
If this is not acceptable, you can find the most signficant set bit of both values (absolutes). If the sum exceeds the number of bits in the integer (or unsigned) then you will have an overflow if they are multiplied together.
Hope this helps.

In C, here's some maturely optimized code that handles the full range of corner cases:
int
would_mul_exceed_int(int a, int b) {
int product_bits;
if (a == 0 || b == 0 || a == 1 || b == 1) return (0); /* always okay */
if (a == INT_MIN || b == INT_MIN) return (1); /* always underflow */
a = ABS(a);
b = ABS(b);
product_bits = significant_bits_uint((unsigned)a);
product_bits += significant_bits_uint((unsigned)b);
if (product_bits == BITS(int)) { /* cases where the more expensive test is required */
return (a > INT_MAX / b); /* remember that IDIV and similar are very slow (dozens - hundreds of cycles) compared to bit shifts, adds */
}
return (product_bits > BITS(int));
}
Full example with test cases here
The benefit of the above approach is it doesn't require casting up to a larger type, so the approach could work on larger integer types.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio