Can I set TIMx ARR value to initialize at chosen value?

So I am trying to use a rotary encoders to control menu on my STM32 project. I am using two rotary encoders to control each side of the screen (split menu).
When I initialize the ARR registers of both timers which are responsible for counting the encoder's pulses, it initializes the registers to a 0 value and when I move the encoders counterclockwise the registers overflow and goes to maximum values of 65535 and messes with how my code calculates detents.
Can you guys tell me if there is any way to set the the TIM->CNT value to a custom value somewhere in the middle between 0 and 65535 ?
This way I could easily check differences between values and not worry about the jump in numbers.

when I move the encoders counterclockwise the registers overflow and goes to maximum values of 65535 and messes with how my code calculates detents.
The counter is a 16 bit value put into a 32 bit unsigned register. To get a proper signed value, cast it to int16_t.
int wheelposition = (int16_t)TIMx->CNT;
A value of 65535 (0xFFFF) would be sign-extended to 0xFFFFFFFF, which is interpreted as -1 in a 32 bit integer variable. But then you'd have the problem that it'd overflow from -32768 to +32767.
If you are interested in the signed difference of two position readings, you can do the subtraction on the unsigned values, and cast the result to int16_t.
uint32_t oldposition, newposition;
int wheelmovement;
oldposition = TIMx->CNT;
/* wait a bit */
newposition = TIMx->CNT;
wheelmovement = (int16_t)(newposition - oldposition);
It will give you the signed difference, with 16 bit overflow taken into account.
is any way to set the the TIM->CNT value to a custom value somewhere in the middle between 0 and 65535 ?
You can simply assign any value to TIMx->CNT, it will continue counting from there.


Gathering half-float values using AVX

Using AVX/AVX2 intrinsics, I can gather sets of 8 values, either 1,2 or 4 byte integers, or 4 byte floats using:
But currently, I have a case where I am loading data that was generated on an nvidia GPU and stored as FP16 values. How can I do vectorized loads of these values?
So far, I found the _mm256_cvtph_ps() intrinsic.
However, input for that intrinsic is a __m128i value, not a __m256i value.
Looking at the Intel Intrinsics Guide, I see no gather operations that store 8 values into an _mm128i register?
How can I gather FP16 values into the 8 lanes of a __m256 register? Is it possible to vector load them as 2-byte shorts into __m256i and then somehow reduce that to a __m128i value to be passed into the conversion intrinsic? If so, I haven't found intrinsics to do that.
I tried the cast as suggested by #peter-cordes but I am getting bogus results from that. Also, I don't understand how that could work?
My 2-byte int values are stored in __m256i as:
0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX
so how can I simply cast to __m128i where it needs to be tightly packed as
Will the cast do that?
My current code:
__fp16* fielddensity = ...
__m256i indices = ...
__m256i msk = _mm256_set1_epi32(0xffff);
__m256i d = _mm256_and_si256(_mm256_i32gather_epi32(fielddensity,indices,2), msk);
__m256 v = _mm256_cvtph_ps(_mm256_castsi256_si128(d));
But the result doesn't seem to be 8 properly formed values. I think every 2nd one is currently bogus for me?
There is indeed no gather instruction for 16bit values so you need to gather 32 bit values and ignore one half of them (and make sure that you don't accidentally read from invalid memory). Also, _mm256_cvtph_ps() needs all input values in the lower 128 bit lane and unfortunately, there is no lane-crossing 16 bit shuffle (until AVX512).
However, assuming you have only finite input values, you could do some bit-twiddling (avoiding the _mm256_cvtph_ps()). If you load a half precision value into the upper half of a 32 bit register you can do the following operations:
SEEEEEMM MMMMMMMM XXXXXXXX XXXXXXXX // input Sign, Exponent, Mantissa, X=garbage
Shift arithmetically to the right by 3 (this keeps the sign bit where it needs to be):
Mask away excessive sign bits and garbage at the bottom (with 0b1000'11111'11111111111'0000000000000)
S000EEEE EMMMMMMM MMM00000 00000000
This will be a valid single precision float but the exponent will be off by 112=127-15 (the difference between the biases), i.e. you need to multiply these values by 2**112 (this may be combined with any subsequent operation, you intend to do anyway later). Note that this will also convert sub-normal float16 values to the corresponding sub-normal float32 value (which are also off by a factor of 2**112).
Untested intrinsic version:
__m256 gather_fp16(__fp16 const* fielddensity, __m256i indices){
// subtract 2 bytes from base address to load data into high parts:
int32_t const* base = (int32_t const*) ( fielddensity - 1);
// Gather 32bit values.
// Be aware that this reads two bytes before each desired value,
// i.e., make sure that reading fielddensitiy[-1] is ok!
__m256i d = _mm256_i32gather_epi32(base, indices, 2);
// shift exponent bits to the right place and mask away excessive bits:
d = _mm256_and_si256(_mm256_srai_epi32(d, 3), _mm256_set1_epi32(0x8fffe000));
// scale values to compensate bias difference (could be combined with subsequent operations ...)
__m256 two112 = _mm256_castsi256_ps(_mm256_set1_epi32(0x77800000)); // 2**112
__m256 f = _mm256_mul_ps(_mm256_castsi256_ps(d), two112);
return f;

Approach for determining max and min values of various types

I would like to know if my approach to determining the max and min values of the types are correct.I have googled around and could not find an exact methodology to determine this
This is my approach :
To confirm the sizes of types I am using this link
Now the link states that the size of int_32_t (which by default is signed) will be 16 bits in LP32 .So max 16 bit no is 65535. But since its signed we will get a max of 65535/2 = 32767.5 so I am assuming its range will be -32767 to 32767 ? Am I correct ? And similarly for uint32_t the size is will be 16 bits in LP32 .So max 16 bit no is 65535 so range will be 0 to 65535 ? Am I correct ? Also what is the difference between LP32 and ILP32 which one should I be following ?
You don't have to make any assumptions. Just use the standard traits:
To address some of your points:
int_32_t (which by default is signed)
Not by default, but mandated by the language standard. It's a signed integer. The unsigned equivalent is std::uint32_t.
int32_t [...] will be 16 bits
Umm... nope. It's signed integer type with width of exactly 32 bits with no padding bits and using 2's complement for negative values (provided only if the implementation directly supports the type)
I am assuming its range will be -32767 to 32767? Am I correct ?
No. For a 16 bits signed integer using 2's complement (std::int16_t) the range is
-32,768 .. 32,767
Here is how you can reach these numbers:
The int type has 16 bits (and no padding). With 16 bits you can encode 2^16 = 65,536 distinct values. In two's complement these values are distributed as follows:
[0, 32,767]: 32,768 positive values
[-32,768, -1]: 32,768 negative values

bit vector implementation of set in Programming Pearls, 2nd Edition

On Page 140 of Programming Pearls, 2nd Edition, Jon proposed an implementation of sets with bit vectors.
We'll turn now to two final structures that exploit the fact that our sets represent integers. Bit vectors are an old friend from Column 1. Here are their private data and functions:
enum { BITSPERWORD = 32, SHIFT = 5, MASK = 0x1F };
int n, hi, *x;
void set(int i) { x[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i) { x[i>>SHIFT] &= ~(1<<(i & MASK)); }
int test(int i) { return x[i>>SHIFT] &= (1<<(i & MASK)); }
As I gathered, the central idea of a bit vector to represent an integer set, as described in Column 1, is that the i-th bit is turned on if and only if the integer i is in the set.
But I am really at a loss at the algorithms involved in the above three functions. And the book doesn't give an explanation.
I can only get that i & MASK is to get the lower 5 bits of i, while i>>SHIFT is to move i 5 bits toward the right.
Anybody would elaborate more on these algorithms? Bit operations always seem a myth to me, :(
Bit Fields and You
I'll use a simple example to explain the basics. Say you have an unsigned integer with four bits:
[0][0][0][0] = 0
You can represent any number here from 0 to 15 by converting it to base 2. Say we have the right end be the smallest:
[0][1][0][1] = 5
So the first bit adds 1 to the total, the second adds 2, the third adds 4, and the fourth adds 8. For example, here's 8:
[1][0][0][0] = 8
So What?
Say you want to represent a binary state in an application-- if some option is enabled, if you should draw some element, and so on. You probably don't want to use an entire integer for each one of these- it'd be using a 32 bit integer to store one bit of information. Or, to continue our example in four bits:
[0][0][0][1] = 1 = ON
[0][0][0][0] = 0 = OFF //what a huge waste of space!
(Of course, the problem is more pronounced in real life since 32-bit integers look like this:
[0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0][0] = 0
The answer to this is to use a bit field. We have a collection of properties (usually related ones) which we will flip on and off using bit operations. So, say, you might have 4 different lights on a piece of hardware that you want to be on or off.
3 2 1 0
[0][0][0][0] = 0
(Why do we start with light 0? I'll explain this in a second.)
Note that this is an integer, and is stored as an integer, but is used to represent multiple states for multiple objects. Crazy! Say we turn lights 2 and 1 on:
3 2 1 0
[0][1][1][0] = 6
The important thing you should note here: There's probably no obvious reason why lights 2 and 1 being on should equal six, and it may not be obvious how we would do anything with this scheme of information storage. It doesn't look more obvious if you add more bits:
3 2 1 0
[1][1][1][0] = 0xE \\what?
Why do we care about this? Do we have exactly one state for each number between 0 and 15?How are we going to manage this without some insane series of switch statements? Ugh...
The Light at the End
So if you've worked with binary arithmetic a bit before, you might realize that the relationship between the numbers on the left and the numbers on the right is, of course, base 2. That is:
1*(23) + 1*(22) + 1*(21) +0 *(20) = 0xE
So each light is present in the exponent of each term of the equation. If the light is on, there is a 1 next to its term- if the light is off, there is a zero. Take the time to convince yourself that there is exactly one integer between 0 and 15 that corresponds to each state in this numbering scheme.
Bit operators
Now that we have this done, let's take a second to see what bitshifting does to integers in this setup.
[0][0][0][1] = 1
When you shift bits to the left or the right in an integer, it literally moves the bits left and right. (Note: I 100% disavow this explanation for negative numbers! There be dragons!)
1<<2 = 4
[0][1][0][0] = 4
4>>1 = 2
[0][0][1][0] = 2
You will encounter similar behavior when shifting numbers represented with more than one bit. Also, it shouldn't be hard to convince yourself that x>>0 or x<<0 is just x. Doesn't shift anywhere.
This probably explains the naming scheme of the Shift operators to anyone who wasn't familiar with them.
Bitwise operations
This representation of numbers in binary can also be used to shed some light on the operations of bitwise operators on integers. Each bit in the first number is xor-ed, and-ed, or or-ed with its fellow number. Take a second to venture to wikipedia and familiarize yourself with the function of these Boolean operators - I'll explain how they function on numbers but I don't want to rehash the general idea in great detail.
Welcome back! Let's start by examining the effect of the OR (|) operator on two integers, stored in four bit.
[1][0][0][1] = 0x9
[1][1][0][0] = 0xC
[1][1][0][1] = 0xD
Tough! This is a close analogue to the truth table for the boolean OR operator. Notice that each column ignores the adjacent columns and simply fills in the result column with the result of the first bit and the second bit OR'd together. Note also that the value of anything or'd with 1 is 1 in that particular column. Anything or'd with zero remains the same.
The table for AND (&) is interesting, though somewhat inverted:
[1][0][0][1] = 0x9
[1][1][0][0] = 0xC
[1][0][0][0] = 0x8
In this case we do the same thing- we perform the AND operation with each bit in a column and put the result in that bit. No column cares about any other column.
Important lesson about this, which I invite you to verify by using the diagram above: anything AND-ed with zero is zero. Also, equally important- nothing happens to numbers that are AND-ed with one. They stay the same.
The final table, XOR, has behavior which I hope you all find predictable by now.
[1][0][0][1] = 0x9
[1][1][0][0] = 0xC
[0][1][0][1] = 0x5
Each bit is being XOR'd with its column, yadda yadda, and so on. But look closely at the first row and the second row. Which bits changed? (Half of them.) Which bits stayed the same? (No points for answering this one.)
The bit in the first row is being changed in the result if (and only if) the bit in the second row is 1!
The one lightbulb example!
So now we have an interesting set of tools we can use to flip individual bits. Let's go back to the lightbulb example and focus only on the first lightbulb.
[?] \\We don't know if it's one or zero while coding
We know that we have an operation that can always make this bit equal to one- the OR 1 operator.
0|1 = 1
1|1 = 1
So, ignoring the rest of the bulbs, we could do this
4_bit_lightbulb_integer |= 1;
and know for sure that we did nothing but set the first lightbulb to ON.
3 2 1 0
[0][0][0][?] = 0 or 1? \\4_bit_lightbulb_integer
[0][0][0][1] = 1
[0][0][0][1] = 0x1
Similarly, we can AND the number with zero. Well- not quite zero- we don't want to affect the state of the other bits, so we will fill them in with ones.
I'll use the unary (one-argument) operator for bit negation. The ~ (NOT) bitwise operator flips all of the bits in its argument. ~(0X1):
[0][0][0][1] = 0x1
[1][1][1][0] = 0xE
We will use this in conjunction with the AND bit below.
Let's do 4_bit_lightbulb_integer & 0xE
3 2 1 0
[0][1][0][?] = 4 or 5? \\4_bit_lightbulb_integer
[1][1][1][0] = 0xE
[0][1][0][0] = 0x4
We're seeing a lot of integers on the right-hand-side which don't have any immediate relevance. You should get used to this if you deal with bit fields a lot. Look at the left-hand side. The bit on the right is always zero and the other bits are unchanged. We can turn off light 0 and ignore everything else!
Finally, you can use the XOR bit to flip the first bit selectively!
3 2 1 0
[0][1][0][?] = 4 or 5? \\4_bit_lightbulb_integer
[0][0][0][1] = 0x1
[0][1][0][*] = 4 or 5?
We don't actually know what the value of * is now- just that flipped from whatever ? was.
Combining Bit Shifting and Bitwise operations
The interesting fact about these two operations is when taken together they allow you to manipulate selective bits.
[0][0][0][1] = 1 = 1<<0
[0][0][1][0] = 2 = 1<<1
[0][1][0][0] = 4 = 1<<2
[1][0][0][0] = 8 = 1<<3
Hmm. Interesting. I'll mention the negation operator here (~) as it's used in a similar way to produce the needed bit values for ANDing stuff in bit fields.
[1][1][1][0] = 0xE = ~(1<<0)
[1][1][0][1] = 0xD = ~(1<<1)
[1][0][1][1] = 0xB = ~(1<<2)
[0][1][1][1] = 0X7 = ~(1<<3)
Are you seeing an interesting relationship between the shift value and the corresponding lightbulb position of the shifted bit?
The canonical bitshift operators
As alluded to above, we have an interesting, generic method for turning on and off specific lights with the bit-shifters above.
To turn on a bulb, we generate the 1 in the right position using bit shifting, and then OR it with the current lightbulb positions. Say we want to turn on light 3, and ignore everything else. We need to get a bit shifting operation that ORs
3 2 1 0
[?][?][?][?] \\all we know about these values at compile time is where they are!
and 0x8
[1][0][0][0] = 0x8
Which is easy, thanks to bitshifting! We'll pick the number of the light and switch the value over:
1<<3 = 0x8
and then:
4_bit_lightbulb_integer |= 0x8;
3 2 1 0
[1][?][?][?] \\the ? marks have not changed!
And we can guarantee that the bit for the 3rd lightbulb is set to 1 and that nothing else has changed.
Clearing a bit works similarly- we'll use the negated bits table above to, say, clear light 2.
~(1<<2) = 0xB = [1][0][1][1]
4_bit_lightbulb_integer & 0xB:
3 2 1 0
The XOR method of flipping bits is the same idea as the OR one.
So the canonical methods of bit switching are this:
Turn on the light i:
Turn off light i:
Flip light i:
Wait, how do I read these?
In order to check a bit we can simply zero out all of the bits except for the one we care about. We'll then check to see if the resulting value is greater than zero- since this is the only value that could possibly be nonzero, it will make the entire integer nonzero if and only if it is nonzero. For example, to check bit 2:
1<<2 & 4_bit_lightbulb_integer:
Remember from the previous examples that the value of ? didn't change. Remember also that anything AND 0 is 0. So, we can say for sure that if this value is greater than zero, the switch at position 2 is true and the lightbulb is zero. Similarly, if the value is off, the value of the entire thing will be zero.
(You can alternately shift the entire value of 4_bit_lightbulb_integer over by i bits and AND it with 1. I don't remember off the top of my head if one is faster than the other but I doubt it.)
So the canonical checking function:
Check if bit i is on:
if (4_bit_lightbulb_integer & 1<<i) {
\\do whatever
The specifics
Now that we have a complete set of tools for bitwise operations, we can look at the specific example here. This is basically the same idea- except a much more concise and powerful way of executing it. Let's look at this function:
void set(int i) { x[i>>SHIFT] |= (1<<(i & MASK)); }
From the canonical implementation I'm going to make a guess that this is trying to set some bits to 1! Let's take an integer and look at what's going on here if i feed the value 0x32 (50 in decimal) into i:
x[0x32>>5] |= (1<<(0x32 & 0x1f))
Well, that's a mess.. let's dissect this operation on the right. For convenience, pretend there are 24 more irrelevant zeros, since these are both 32 bit integers.
...[0][0][0][1][1][1][1][1] = 0x1F
...[0][0][1][1][0][0][1][0] = 0x32
...[0][0][0][1][0][0][1][0] = 0x12
It looks like everything is being cut off at the boundary on top where 1s turn into zeros. This technique is called Bit Masking. Interestingly, the boundary here restricts the resulting values to be between 0 and 31... Which is exactly the number of bit positions we have for a 32 bit integer!
x[0x32>>5] |= (1<<(0x12))
Let's look at the other half.
...[0][0][1][1][0][0][1][0] = 0x32
Shift five bits to the right:
...[0][0][0][0][0][0][0][1] = 0x01
Note that this transformation exactly destroyed all information from the first part of the function- we have 32-5 = 27 remaining bits which could be nonzero. This indicates which of 227 integers in the array of integers are selected. So the simplified equation is now:
x[1] |= (1<<0x12)
This just looks like the canonical bit-setting operation! We've just chosen
So the idea is to use the first 27 bits to pick an integer to shift and the last five bits indicate which bit of the 32 in that integer to shift.
The key to understanding what's going on is to recognize that BITSPERWORD = 2SHIFT. Thus, x[i>>SHIFT] finds which 32-bit element of the array x has the bit corresponding to i. (By shifting i 5 bits to the right, you're simply dividing by 32.) Once you have located the correct element of x, the lower 5 bits of i can then be used to find which particular bit of x[i>>SHIFT] corresponds to i. That's what i & MASK does; by shifting 1 by that number of bits, you move the bit corresponding to 1 to the exact position within x[i>>SHIFT] that corresponds to the ith bit in x.
Here's a bit more of an explanation:
Imagine that we want capacity for N bits in our bit vector. Since each int holds 32 bits, we will need (N + 31) / 32 int values for our storage (that is, N/32 rounded up). Within each int value, we will adopt the convention that bits are ordered from least significant to most significant. We will also adopt the convention that the first 32 bits of our vector are in x[0], the next 32 bits are in x[1], and so forth. Here's the memory layout we are using (showing the bit index in our bit vector corresponding to each bit of memory):
x[0]: | 31 | 30 | . . . | 02 | 01 | 00 |
x[1]: | 63 | 62 | . . . | 34 | 33 | 32 |
Our first step is to allocate the necessary storage capacity:
x = new int[(N + BITSPERWORD - 1) >> SHIFT]
(We could make provision for dynamically expanding this storage, but that would just add complexity to the explanation.)
Now suppose we want to access bit i (either to set it, clear it, or just to know its current value). We need to first figure out which element of x to use. Since there are 32 bits per int value, this is easy:
subscript for x = i / 32
Making use of the enum constants, the x element we want is:
x[i >> SHIFT]
(Think of this as a 32-bit-wide window into our N-bit vector.) Now we have to find the specific bit corresponding to i. Looking at the memory layout, it's not hard to figure out that the first (rightmost) bit in the window corresponds to bit index 32 * (i >> SHIFT). (The window starts afteri >> SHIFT slots in x, and each slot has 32 bits.) Since that's the first bit in the window (position 0), then the bit we're interested in is is at position
i - (32 * (i >> SHIFT))
in the windows. With a little experimenting, you can convince yourself that this expression is always equal to i % 32 (actually, that's one definition of the mod operator) which, in turn, is always equal to i & MASK. Since this last expression is the fastest way to calculate what we want, that's what we'll use.
From here, the rest is pretty simple. We start with a single bit in the least-significant position of the window (that is, the constant 1), and move it to the left by i & MASK bits to get it to the position in the window corresponding to bit i in the bit vector. This is where the expression
1 << (i & MASK)
comes from. With the bit now moved to where we want it, we can use this as a mask to set, clear, or query the value of the bit at that position in x[i>>SHIFT] and we know that we're actually setting, clearing, or querying the value of bit i in our bit vector.
If you store your bits in an array of n words you can imagine them to be layed out as a matrix with n rows and 32 columns (BITSPERWORD):
3 0
1 0
0 xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
1 xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
2 xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
n xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx
To get the k-th bit you divide k by 32. The (integer) result will give you the row (word) the bit is in, the reminder will give you which bit is within the word.
Dividing by 2^p can be done simply by shifting p postions to the right. The reminder can be obtained by getting the p rightmost bits (i.e the bitwise AND with (2^p - 1)).
In C terms:
#define div32(k) ((k) >> 5)
#define mod32(k) ((k) & 31)
#define word_the_bit_is_in(k) div32(k)
#define bit_within_word(k) mod32(k)
Hope it helps.

linear interpolation on 8bit microcontroller

I need to do a linear interpolation over time between two values on an 8 bit PIC microcontroller (Specifically 16F627A but that shouldn't matter) using PIC assembly language. Although I'm looking for an algorithm here as much as actual code.
I need to take an 8 bit starting value, an 8 bit ending value and a position between the two (Currently represented as an 8 bit number 0-255 where 0 means the output should be the starting value and 255 means it should be the final value but that can change if there is a better way to represent this) and calculate the interpolated value.
Now PIC doesn't have a divide instruction so I could code up a general purpose divide routine and effectivly calculate (B-A)/(x/255)+A at each step but I feel there is probably a much better way to do this on a microcontroller than the way I'd do it on a PC in c++
Has anyone got any suggestions for implementing this efficiently on this hardware?
The value you are looking for is (A*(255-x)+B*x)/255. It requires only 8x8 multiplication, and a final division by 255, which can be approximated by simply taking the high byte of the sum.
Choosing x in range 0..128, no approximation is needed: take the high byte of (A*(128-x)+B*x)<<1.
Assuming you interpolate a sequence of values where the previous endpoint is the new start point:
sounds like a bad idea. If you use base 255 as a fixedpoint representation, you get the same interpolant twice. You get B when x=255 and B as the new A when x=0.
Use 256 as the fixedpoint system. Divides become shifts, but you need 16-bit arithmetic and 8x8 multiplication with a 16-bit result. The previous issue can be fixed by simply ignoring any bits in the higher-bytes as x mod 256 becomes 0. This suggestion uses 16-bit multiplication, but can't overflow. and you don't interpolate over the same x twice.
interp = (a*(256 - x) + b*x) >> 8
256 - x becomes just a subtract-with-borrow, as you get 0 - x.
The PIC lacks these operations in its instruction set:
Right and left shift. (both logical and arithmetic)
Any form of multiplication.
You can get right-shifting by using rotate-right instead, followed by masking out the extra bits on the left with bitwise-and. A straight-forward way to do 8x8 multiplication with 16-bit result:
void mul16(
unsigned char* hi, /* in: operand1, out: the most significant byte */
unsigned char* lo /* in: operand2, out: the least significant byte */
unsigned char a,b;
/* loop over the smallest value */
a = (*hi <= *lo) ? *hi : *lo;
b = (*hi <= *lo) ? *lo : *hi;
*hi = *lo = 0;
if(*lo < b) /* unsigned overflow. Use the carry flag instead.*/
The techniques described by Eric Bainville and Mads Elvheim will work fine; each one uses two multiplies per interpolation.
Scott Dattalo and Tony Kubek have put together a super-optimized PIC-specific interpolation technique called "twist" that is slightly faster than two multiplies per interpolation.
Is using this difficult-to-understand technique worth running a little faster?
You could do it using 8.8 fixed-point arithmetic. Then a number from range 0..255 would be interpreted as 0.0 ... 0.996 and you would be able to multiply and normalize it.
Tell me if you need any more details or if it's enough for you to start.
You could characterize this instead as:
using a value range of x=0..255, precompute the values of 256/(x+1) as a fixed-point number in a table, and then code a general purpose multiply, adjust for the position of the binary point. This might not be small spacewise; I'd expect you to need a 256 entry table of 16 bit values and the multiply code. (If you don't need speed, this would suggest your divison method is fine.). But it only takes one multiply and an add.
My guess is that you don't need every possible value of X. If there are only a few values of X, you can compute them offline, do a case-select on the specific value of X and then implement the multiply in terms of a fixed sequence of shifts and adds for the specific value of X. That's likely to be pretty efficient in code and very fast for a PIC.
Given two values X & Y , its basically:
X/2 + Y/2 (to prevent the odd-case that A+B might overflow the size of the register)
Hence try the following:
Initially A=MAX, B=MIN
Loop {
Right-Shift A by 1-bit.
Right-Shift B by 1-bit.
C = ADD the two results.
Check MSB of 8-bit interpolation value
if MSB=0, then B=C
if MSB=1, then A=C
Left-Shift 8-bit interpolation value
}Repeat until 8-bit interpolation value becomes zero.
The actual code is just as easy. Only i do not remember the registers and instructions off-hand.

How do I detect overflow while multiplying two 2's complement integers?

I want to multiply two numbers, and detect if there was an overflow. What is the simplest way to do that?
Multiplying two 32 bit numbers results in a 64 bit answer, two 8s give a 16, etc. binary multiplication is simply shifting and adding. so if you had say two 32 bit operands and bit 17 set in operand A and any of the bits above 15 or 16 set in operand b you will overflow a 32 bit result. bit 17 shifted left 16 is bit 33 added to a 32.
So the question again is what are the size of your inputs and the size of your result, if the result is the same size then you have to find the most significant 1 of both operands add those bit locations if that result is bigger than your results space you will overflow.
Yes multiplying two 3 bit numbers will result in either a 5 bit number or 6 bit number if there is a carry in the add. Likewise a 2 bit and 5 bit can result in 6 or 7 bits, etc. If the reason for this question posters question is to see if you have space in your result variable for an answer then this solution will work and is relatively fast for most languages on most processors. It can be significantly faster on some and significantly slower on others. It is generically fast (depending on how it is implemented of course) to just look at the number of bits in the operands. Doubling the size of the largest operand is a safe bet if you can do it within your language or processor. Divides are downright expensive (slow) and most processors dont have one much less at an arbitrary doubling of operand sizes. The fastest of course is to drop to assembler do the multiply and look at the overflow bit (or compare one of the result registers with zero). If your processor cant do the multiply in hardware then it is going to be slow no matter what you do. I am guessing that asm is not the right answer to this post despite being by far the fastest and has the most accurate overflow status.
binary makes multiplication trivial compared to decimal, for example take the binary numbers
0b100 *
Just like decimal math in school you (can) start with the least significant bit on the lower operand and multiply it against all the locations in the upper operand, except with binary there are only two choices you multiply by zero meaning you dont have to add to the result, or you multiply by one which means you just shift and add, no actual multiplication is necessary like you would have in decimal.
000 : 0 * 100
000 : 0 * 100
100 : 1 * 100
Add up the columns and the answer is 0b10000
Same as decimal math a 1 in the hundreds column means copy the top number and add two zeros, it works the same in any other base as well. So 0b100 times 0b110 is 0b1000, a one in the second column over so copy and add a zero + 0b10000 a one in the third column over so copy and add two zeros = 0b11000.
This leads to looking at the most significant bits in both numbers. 0b1xx * 0b1xx guarantees a 1xxxx is added to the answer, and that is the largest bit location in the add, no other single inputs to the final add have that column populated or a more significant column populated. From there you need only more bit in case the other bits being added up cause a carry.
Which happens with the worst case all ones times all ones, 0b111 * 0b111
0b00111 +
0b01110 +
This causes a carry bit in the addition resulting in 0b110001. 6 bits. a 3 bit operand times a 3 bit operand 3+3=6 6 bits worst case.
So size of the operands using the most significant bit (not the size of the registers holding the values) determines the worst case storage requirement.
Well, that is true assuming positive operands. If you consider some of these numbers to be negative it changes things but not by much.
Minus 4 times 5, 0b1111...111100 * 0b0000....000101 = -20 or 0b1111..11101100
it takes 4 bits to represent a minus 4 and 4 bits to represent a positive 5 (dont forget your sign bit). Our result required 6 bits if you stripped off all the sign bits.
Lets look at the 4 bit corner cases
-8 * 7 = -56
0b1000 * 0b0111 = 0b1001000
-1 * 7 = -7 = 0b1001
-8 * -8 = 64 = 0b01000000
-1 * -1 = 2 = 0b010
-1 * -8 = 8 = 0b01000
7 * 7 = 49 = 0b0110001
Lets say we count positive numbers as the most significant 1 plus one and negative the most significant 0 plus one.
-8 * 7 is 4+4=8 bits actual 7
-1 * 7 is 1+4=5 bits, actual 4 bits
-8 * -8 is 4+4=8 bits, actual 8 bits
-1 * -1 is 1+1=2 bits, actual 3 bits
-1 * -8 is 1+4=5 bits, actual 5 bits
7 * 7 is 4+4=8 bits, actual 7 bits.
So this rule works, with the exception of -1 * -1, you can see that I called a minus one one bit, for the plus one thing find the zero plus one. Anyway, I argue that if this were a 4 bit * 4 bit machine as defined, you would have 4 bits of result at least and I interpret the question as how may more than 4 bits do I need to safely store the answer. So this rule serves to answer that question for 2s complement math.
If your question was to accurately determine overflow and then speed is secondary, then, well it is going to be really really slow for some systems, for every multiply you do. If this is the question you are asking, to get some of the speed back you need to tune it a little better for the language and/or processor. Double up the biggest operand, if you can, and check for non-zero bits above the result size, or use a divide and compare. If you cant double the operand sizes, divide and compare. Check for zero before the divide.
Actually your question doesnt specify what size of overflow you are talking about either. Good old 8086 16 bit times 16 bit gives a 32 bit result (hardware), it can never overflow. What about some of the ARMs that have a multiply, 32 bit times 32 bit, 32 bit result, easy to overflow. What is the size of your operands for this question, are they the same size or are they double the input size? Are you willing to perform multiplies that the hardware cannot do (without overflowing)? Are you writing a compiler library and trying to determine if you can feed the operands to the hardware for speed or if you have to perform the math without a hardware multiply. Which is the kind of thing you get if you cast up the operands, the compiler library will try to cast the operands back down before doing the multiply, depending on the compiler and its library of course. And it will use the count the bit trick determine to use the hardware multiply or a software one.
My goal here was to show how binary multiply works in a digestible form so you can see how much maximum storage you need by finding the location of a single bit in each operand. Now how fast you can find that bit in each operand is the trick. If you were looking for minimum storage requirements not maximum that is a different story because involves every single one of the significant bits in both operands not just one bit per operand, you have to do the multiply to determine minimum storage. If you dont care about maximum or minimum storage you have to just do the multiply and look for non zeros above your defined overflow limit or use a divide if you have the time or hardware.
Your tags imply you are not interested in floating point, floating point is a completely different beast, you cannot apply any of these fixed point rules to floating point, they DO NOT work.
Check if one is less than a maximum value divided by the other. (All values are taken as absolute).
2's complementness hardly has anything to do with it, since the multiplication overflows if x*(2n - x)>2M, which is equal to (x*2n - x2)>2M, or x2 < (x*2n - 2M), so you'll have to compare overflowing numbers anyway (x2 may overflow, while result may not).
If your number are not from the largest integral data type, then you might just cast them up, multiply and compare with the maximum of the number's original type. E.g. in Java, when multiplying two int, you can cast them to long and compare the result to Integer.MAX_VALUE or Integer.MIN_VALUE (depending on sign combination), before casting the result down to int.
If the type already is the largest, then check if one is less than the maximum value divided by the other. But do not take the absolute value! Instead you need separate comparison logic for each of the sign combinations negneg, pospos and posneg (negpos can obviously be reduced to posneg, and pospos might be reduced to neg*neg). First test for 0 arguments to allow safe divisions.
For actual code, see the Java source of MathUtils class of the commons-math 2, or ArithmeticUtils of commons-math 3. Look for public static long mulAndCheck(long a, long b). The case for positive a and b is
// check for positive overflow with positive a, positive b
if (a <= Long.MAX_VALUE / b) {
ret = a * b;
} else {
throw new ArithmeticException(msg);
I want to multiply two (2's complement) numbers, and detect if there was an overflow. What is the simplest way to do that?
Various languages do not specify valid checking for overflow after it occurs and so prior tests are required.
With some types, a wider integer type may not exist, so a general solution should limit itself to a single type.
The below (Ref) only requires compares and known limits to the integer range. It returns 1 if a product overflow will occur, else 0.
int is_undefined_mult1(int a, int b) {
if (a > 0) {
if (b > 0) {
return a > INT_MAX / b; // a positive, b positive
return b < INT_MIN / a; // a positive, b not positive
if (b > 0) {
return a < INT_MIN / b; // a not positive, b positive
return a != 0 && b < INT_MAX / a; // a not positive, b not positive
Is this the simplest way?
Perhaps, yet it is complete and handle all cases known to me - including rare non-2's complement.
Alternatives to Pavel Shved's solution ...
If your language of choice is assembler, then you should be able to check the overflow flag. If not, you could write a custom assembler routine that sets a variable if the overflow flag was set.
If this is not acceptable, you can find the most signficant set bit of both values (absolutes). If the sum exceeds the number of bits in the integer (or unsigned) then you will have an overflow if they are multiplied together.
Hope this helps.
In C, here's some maturely optimized code that handles the full range of corner cases:
would_mul_exceed_int(int a, int b) {
int product_bits;
if (a == 0 || b == 0 || a == 1 || b == 1) return (0); /* always okay */
if (a == INT_MIN || b == INT_MIN) return (1); /* always underflow */
a = ABS(a);
b = ABS(b);
product_bits = significant_bits_uint((unsigned)a);
product_bits += significant_bits_uint((unsigned)b);
if (product_bits == BITS(int)) { /* cases where the more expensive test is required */
return (a > INT_MAX / b); /* remember that IDIV and similar are very slow (dozens - hundreds of cycles) compared to bit shifts, adds */
return (product_bits > BITS(int));
Full example with test cases here
The benefit of the above approach is it doesn't require casting up to a larger type, so the approach could work on larger integer types.
