Fast saturating integer conversion? - performance

I'm wondering if there is any fast bit twiddling trick to do a saturating conversion from a 64-bit unsigned value to a 32-bit unsigned value (it would be nice if it is generalized to other widths but that's the main width I care about). Most of the resources I've been able to find googling have been for saturating arithmetic operations.
A saturating conversion would take a 64-bit unsigned value, and return either the value unmodified as a 32-bit value or 2^32-1 if the input value is greater than 2^32-1. Note that this is not what the default C casting truncating behaviour does.
I can imagine doing something like:
Test if upper half has any bit set
If so create a 32-bit mask with all bits set, otherwise create a mask with all bits unset
Bitwise-or lower half with mask
But I don't know how to quickly generate the mask. I tried straightforward branching implementations in Godbolt to see if the compiler would generate a clever branchless implementation for me but no luck.
Implementation example here.
#include <stdint.h>
#include <limits.h>
// Type your code here, or load an example.
uint32_t square(uint64_t num) {
return num > UINT32_MAX ? UINT32_MAX : num;
}
Edit: my mistake, issue was godbolt not set to use optimizations

You don't need to do any fancy bit twiddling trick to do this. The following function should be enough for compilers to generate efficient code:
uint32_t saturate(uint64_t value) {
return value > UINT32_MAX ? UINT32_MAX : value;
}
This contains a conditional statement, but most common CPUs, like AMD/Intel and Arm ones, have conditional move instructions. So they will test the value for overflowing 32 bits, and based on the test they will replace it with UINT32_MAX, otherwise leave it alone. For example, on 64-bit Arm processors this function will be compiled by GCC (to:
saturate:
mov x1, 4294967295
cmp x0, x1
csel x0, x0, x1, ls
ret
Note that you must enable compiler optimizations to get the above result.

A way to do this without relying on conditional moves is
((-(x >> 32)) | (x << 32)) >> 32

Related

Mathematically find the value that is closest to 0

Is there a way to determine mathematically if a value is closer to 0 than another?
For example closerToZero(-2, 3) would return -2.
I tried by removing the sign and then compared the values for the minimum, but I would then be assigning the sign-less version of the initial numbers.
a and b are IEEE-754 compliant floating-point doubles (js number)
(64 bit => 1 bit sign 11 bit exponent 52 bit fraction)
min (a,b) => b-((a-b)&((a-b)>>52));
result = min(abs(a), abs(b));
// result has the wrong sign ...
The obvious algorithm is to compare the absolute values, and use that to select the original values.
If this absolutely needs to be branchless (e.g. for crypto security), be careful with ? : ternary. It often compiles to branchless asm but that's not guaranteed. (I assume that's why you tagged branch-prediction? If it was just out of performance concerns, the compiler will generally make good decisions.)
In languages with fixed-with 2's complement integers, remember that abs(INT_MIN) overflows a signed result of the same width as the input. In C and C++, abs() is inconveniently designed to return an int and it's undefined behaviour to call it with the most-negative 2's complement integer, on 2's complement systems. On systems with well-defined wrapping signed-int math (like gcc -fwrapv, or maybe Java), signed abs(INT_MIN) would overflow back to INT_MIN, giving wrong results if you do a signed compare because INT_MIN is maximally far from 0.
Make sure you do an unsigned compare of the abs results so you correctly handle INT_MIN. (Or as #kaya3 suggests, map positive integers to negative, instead of negative to positive.)
Safe C implementation that avoids Undefined Behaviour:
unsigned absu(int x) {
return x<0? 0U - x : x;
}
int minabs(int a, int b) {
return absu(a) < absu(b) ? a : b;
}
Note that < vs. <= actually matters in minabs: that decides which one to select if their magnitudes are equal.
0U - x converts x to unsigned before a subtract from 0 which can overflow. Converting negative signed-integer types to unsigned is well-defined in C and C++ as modulo reduction (unlike for floats, UB IIRC). On 2's complement machines that means using the same bit-pattern unchanged.
This compiles nicely for x86-64 (Godbolt), especially with clang. (GCC avoids cmov even with -march=skylake, ending up with a worse sequence. Except for the final select after doing both absu operations, then it uses cmovbe which is 2 uops instead of 1 for cmovb on Intel CPUs, because it needs to read both ZF and CF flags. If it ended up with the opposite value in EAX already, it could have used cmovb.)
# clang -O3
absu:
mov eax, edi
neg eax # sets flags like sub-from-0
cmovl eax, edi # select on signed less-than condition
ret
minabs:
mov ecx, edi
neg ecx
cmovl ecx, edi # inlined absu(a)
mov eax, esi
mov edx, esi
neg edx
cmovl edx, esi # inlined absu(b)
cmp ecx, edx # compare absu results
cmovb eax, edi # select on unsigned Below condition.
ret
Fully branchless with both GCC and clang, with optimization enabled. It's a safe bet that other ISAs will be the same.
It might auto-vectorize decently, but x86 doesn't have SIMD unsigned integer compares until AVX512. (You can emulate by flipping the high bit to use signed integer pcmpgtd).
For float / double, abs is cheaper and can't overflow: just clear the sign bit, then use that to select the original.

How is this faster? 52-bit modulo multiply using "FPU trick" faster than inline ASM on x64

I have found that this:
#define mulmod52(a,b,m) (((a * b) - (((uint64_t)(((double)a * (double)b) / (double)m) - 1ULL) * m)) % m)
... is faster than:
static inline uint64_t _mulmod(uint64_t a, uint64_t b, uint64_t n) {
uint64_t d, dummy; /* d will get a*b mod c */
asm ("mulq %3\n\t" /* mul a*b -> rdx:rax */
"divq %4\n\t" /* (a*b)/c -> quot in rax remainder in rdx */
:"=a"(dummy), "=&d"(d) /* output */
:"a"(a), "rm"(b), "rm"(n) /* input */
:"cc" /* mulq and divq can set conditions */
);
return d;
}
The former is a trick to exploit the FPU to compute modular multiplication of two up to 52 bit numbers. The latter is simple X64 ASM to compute modular multiplication of two 64 bit numbers, and of course it also works just fine for only 52 bits.
The former is faster than the latter by about 5-15% depending on where I test it.
How is this possible given that the FPU trick also involves one integer multiply and one integer divide (modulus) plus additional FPU work? There's something I'm not understanding here. Is it some weird compiler artifact such as the asm inline ruining compiler optimization passes?
On pre-Icelake processors, such as Skylake, there is a big difference between a "full" 128bit-by-64bit division and a "half" 64bit-by-64bit division (where the upper qword is zero). The full one can take up to nearly 100 cycles (varies a bit depending on the value in rdx, but there is a sudden "cliff" when rdx is even set to 1), the half one is more around 30 to 40-ish cycles depending on the µarch.
The 64bit floating point division is (for a division) relatively fast at around 14 to 20 cycles depending on the µarch, so even with that and some other even-less-significant overhead thrown in, that's not enough to waste the 60 cycle advantage that the "half" division has compared to the "full" division. So the complicated floating point version can come out ahead.
Icelake apparently has an amazing divider that can do a full division in 18 cycles (and a "half" division isn't faster), the inline asm should be good on Icelake.
On AMD Ryzen, divisions with a non-zero upper qword seem to get slower more gradually as rdx gets higher (less of a "perf cliff").

Strange pointer arithmetic

I came across too strange behaviour of pointer arithmetic. I am developing a program to develop SD card from LPC2148 using ARM GNU toolchain (on Linux). My SD card a sector contains data (in hex) like (checked from linux "xxd" command):
fe 2a 01 34 21 45 aa 35 90 75 52 78
While printing individual byte, it is printing perfectly.
char *ch = buffer; /* char buffer[512]; */
for(i=0; i<12; i++)
debug("%x ", *ch++);
Here debug function sending output on UART.
However pointer arithmetic specially adding a number which is not multiple of 4 giving too strange results.
uint32_t *p; // uint32_t is typedef to unsigned long.
p = (uint32_t*)((char*)buffer + 0);
debug("%x ", *p); // prints 34012afe // correct
p = (uint32_t*)((char*)buffer + 4);
debug("%x ", *p); // prints 35aa4521 // correct
p = (uint32_t*)((char*)buffer + 2);
debug("%x ", *p); // prints 0134fe2a // TOO STRANGE??
Am I choosing any wrong compiler option? Pls help.
I tried optimization options -0 and -s; but no change.
I could think of little/big endian, but here i am getting unexpected data (of previous bytes) and no order reversing.
Your CPU architecture must support unaligned load and store operations.
To the best of my knowledge, it doesn't (and I've been using STM32, which is an ARM-based cortex).
If you try to read a uint32_t value from an address which is not divisible by the size of uint32_t (i.e. not divisible by 4), then in the "good" case you will just get the wrong output.
I'm not sure what's the address of your buffer, but at least one of the three uint32_t read attempts that you describe in your question, requires the processor to perform an unaligned load operation.
On STM32, you would get a memory-access violation (resulting in a hard-fault exception).
The data-sheet should provide a description of your processor's expected behavior.
UPDATE:
Even if your processor does support unaligned load and store operations, you should try to avoid using them, as it might affect the overall running time (in comparison with "normal" load and store operations).
So in either case, you should make sure that whenever you perform a memory access (read or write) operation of size N, the target address is divisible by N. For example:
uint08_t x = *(uint08_t*)y; // 'y' must point to a memory address divisible by 1
uint16_t x = *(uint16_t*)y; // 'y' must point to a memory address divisible by 2
uint32_t x = *(uint32_t*)y; // 'y' must point to a memory address divisible by 4
uint64_t x = *(uint64_t*)y; // 'y' must point to a memory address divisible by 8
In order to ensure this with your data structures, always define them so that every field x is located at an offset which is divisible by sizeof(x). For example:
struct
{
uint16_t a; // offset 0, divisible by sizeof(uint16_t), which is 2
uint08_t b; // offset 2, divisible by sizeof(uint08_t), which is 1
uint08_t a; // offset 3, divisible by sizeof(uint08_t), which is 1
uint32_t c; // offset 4, divisible by sizeof(uint32_t), which is 4
uint64_t d; // offset 8, divisible by sizeof(uint64_t), which is 8
}
Please note, that this does not guarantee that your data-structure is "safe", and you still have to make sure that every myStruct_t* variable that you are using, is pointing to a memory address divisible by the size of the largest field (in the example above, 8).
SUMMARY:
There are two basic rules that you need to follow:
Every instance of your structure must be located at a memory address which is divisible by the size of the largest field in the structure.
Each field in your structure must be located at an offset (within the structure) which is divisible by the size of that field itself.
Exceptions:
Rule #1 may be violated if the CPU architecture supports unaligned load and store operations. Nevertheless, such operations are usually less efficient (requiring the compiler to add NOPs "in between"). Ideally, one should strive to follow rule #1 even if the compiler does support unaligned operations, and let the compiler know that the data is well aligned (using a dedicated #pragma), in order to allow the compiler to use aligned operations where possible.
Rule #2 may be violated if the compiler automatically generates the required padding. This, of course, changes the size of each instance of the structure. It is advisable to always use explicit padding (instead of relying on the current compiler, which may be replaced at some later point in time).
LDR is the ARM instruction to load data. You have lied to the compiler that the pointer is a 32bit value. It is not aligned properly. You pay the price. Here is the LDR documentation,
If the address is not word-aligned, the loaded value is rotated right by 8 times the value of bits [1:0].
See: 4.2.1. LDR and STR, words and unsigned bytes, especially the section Address alignment for word transfers.
Basically your code is like,
p = (uint32_t*)((char*)buffer + 0);
p = (p>>16)|(p<<16);
debug("%x ", *p); // prints 0134fe2a
but has encoded to one instruction on the ARM. This behavior is dependent on the ARM CPU type and possibly co-processor values. It is also highly non-portable code.
It's called "undefined behavior". Your code is casting a value which is not a valid unsigned long * into an unsigned long *. The semantics of that operation are undefined behavior, which means pretty much anything can happen*.
In this case, the reason two of your examples behaved as you expected is because you got lucky and buffer happened to be word-aligned. Your third example was not as lucky (if it was, the other two would not have been), so you ended up with a pointer with extra garbage in the 2 least significant bits. Depending on the version of ARM you are using, that could result in an unaligned read (which it appears is what you were hoping for), or it could result in an aligned read (using the most significant 30 bits) and a rotation (word rotated by the number of bytes indicated in the least significant 2 bits). It looks pretty clear that the later is what happened in your 3rd example.
Anyway, technically, all 3 of your example outputs are correct. It would also be correct for the program to crash on all 3 of them.
Basically, don't do that.
A safer alternative is to write the bytes into a uint32_t. Something like:
uint32_t w;
memcpy(&w, buffer, 4);
debug("%x ", w);
memcpy(&w, buffer+4, 4);
debug("%x ", w);
memcpy(&w, buffer+2, 4);
debug("%x ", w);
Of course, that's still assuming sizeof(uint32_t) == 4 && CHAR_BITS == 8, but that's a much safer assumption. (Ie, it should work on pretty much any machine with 8 bit bytes.)

linear interpolation on 8bit microcontroller

I need to do a linear interpolation over time between two values on an 8 bit PIC microcontroller (Specifically 16F627A but that shouldn't matter) using PIC assembly language. Although I'm looking for an algorithm here as much as actual code.
I need to take an 8 bit starting value, an 8 bit ending value and a position between the two (Currently represented as an 8 bit number 0-255 where 0 means the output should be the starting value and 255 means it should be the final value but that can change if there is a better way to represent this) and calculate the interpolated value.
Now PIC doesn't have a divide instruction so I could code up a general purpose divide routine and effectivly calculate (B-A)/(x/255)+A at each step but I feel there is probably a much better way to do this on a microcontroller than the way I'd do it on a PC in c++
Has anyone got any suggestions for implementing this efficiently on this hardware?
The value you are looking for is (A*(255-x)+B*x)/255. It requires only 8x8 multiplication, and a final division by 255, which can be approximated by simply taking the high byte of the sum.
Choosing x in range 0..128, no approximation is needed: take the high byte of (A*(128-x)+B*x)<<1.
Assuming you interpolate a sequence of values where the previous endpoint is the new start point:
(B-A)/(x/255)+A
sounds like a bad idea. If you use base 255 as a fixedpoint representation, you get the same interpolant twice. You get B when x=255 and B as the new A when x=0.
Use 256 as the fixedpoint system. Divides become shifts, but you need 16-bit arithmetic and 8x8 multiplication with a 16-bit result. The previous issue can be fixed by simply ignoring any bits in the higher-bytes as x mod 256 becomes 0. This suggestion uses 16-bit multiplication, but can't overflow. and you don't interpolate over the same x twice.
interp = (a*(256 - x) + b*x) >> 8
256 - x becomes just a subtract-with-borrow, as you get 0 - x.
The PIC lacks these operations in its instruction set:
Right and left shift. (both logical and arithmetic)
Any form of multiplication.
You can get right-shifting by using rotate-right instead, followed by masking out the extra bits on the left with bitwise-and. A straight-forward way to do 8x8 multiplication with 16-bit result:
void mul16(
unsigned char* hi, /* in: operand1, out: the most significant byte */
unsigned char* lo /* in: operand2, out: the least significant byte */
)
{
unsigned char a,b;
/* loop over the smallest value */
a = (*hi <= *lo) ? *hi : *lo;
b = (*hi <= *lo) ? *lo : *hi;
*hi = *lo = 0;
while(a){
*lo+=b;
if(*lo < b) /* unsigned overflow. Use the carry flag instead.*/
*hi++;
--a;
}
}
The techniques described by Eric Bainville and Mads Elvheim will work fine; each one uses two multiplies per interpolation.
Scott Dattalo and Tony Kubek have put together a super-optimized PIC-specific interpolation technique called "twist" that is slightly faster than two multiplies per interpolation.
Is using this difficult-to-understand technique worth running a little faster?
You could do it using 8.8 fixed-point arithmetic. Then a number from range 0..255 would be interpreted as 0.0 ... 0.996 and you would be able to multiply and normalize it.
Tell me if you need any more details or if it's enough for you to start.
You could characterize this instead as:
(B-A)*(256/(x+1))+A
using a value range of x=0..255, precompute the values of 256/(x+1) as a fixed-point number in a table, and then code a general purpose multiply, adjust for the position of the binary point. This might not be small spacewise; I'd expect you to need a 256 entry table of 16 bit values and the multiply code. (If you don't need speed, this would suggest your divison method is fine.). But it only takes one multiply and an add.
My guess is that you don't need every possible value of X. If there are only a few values of X, you can compute them offline, do a case-select on the specific value of X and then implement the multiply in terms of a fixed sequence of shifts and adds for the specific value of X. That's likely to be pretty efficient in code and very fast for a PIC.
Interpolation
Given two values X & Y , its basically:
(X+Y)/2
or
X/2 + Y/2 (to prevent the odd-case that A+B might overflow the size of the register)
Hence try the following:
(Pseudo-code)
Initially A=MAX, B=MIN
Loop {
Right-Shift A by 1-bit.
Right-Shift B by 1-bit.
C = ADD the two results.
Check MSB of 8-bit interpolation value
if MSB=0, then B=C
if MSB=1, then A=C
Left-Shift 8-bit interpolation value
}Repeat until 8-bit interpolation value becomes zero.
The actual code is just as easy. Only i do not remember the registers and instructions off-hand.

Loop versioning with GCC

I am working on auto vectorization with GCC. I am not in a position to use intrinsics or attributes due to customer requirement. (I cannot get user input to support vectorization)
If the alignment information of the array that can be vectorized is unknown, GCC invokes a pass for 'loop versioning'. Loop versioning will be performed when loop vectorization is done on trees. When a loop is identified to be vectorizable, and the constraint on data alignment or data dependence is hindering it, (because they cannot be determined at compile time), then two versions of the loop will be generated. These are the vectorized and non-vectorized versions of the loop along with runtime checks for alignment or dependence to control which version is executed.
My question is how we have to enforce the alignment? If I have found a loop that is vectorizable, I should not generate two versions of the loop because of missing alignment information.
For example. Consider the below code
short a[15]; short b[15]; short c[15];
int i;
void foo()
{
for (i=0; i<15; i++)
{
a[i] = b[i] ;
}
}
Tree dump (options: -fdump-tree-optimized -ftree-vectorize)
<SNIP>
vector short int * vect_pa.49;
vector short int * vect_pb.42;
vector short int * vect_pa.35;
vector short int * vect_pb.30;
bb 2>:
vect_pb.30 = (vector short int *) &b;
vect_pa.35 = (vector short int *) &a;
if (((signed char) vect_pa.35 | (signed char) vect_pb.30) & 3 == 0) ;; <== (A)
goto <bb 3>;
else
goto <bb 4>;
bb 3>:
</SNIP>
At 'bb 3' version of vectorized code is generated. At 'bb 4' code without vectorization is generated. These are done by checking the alignment (statement 'A'). Now without using intrinsics and other attributes, how should I get only the vectorized code (without this runtime alignment check.)
If the data in question is being allocated statically, then you can use the __align__ attribute that GCC supports to specify that it should be aligned to the necessary boundary. If you are dynamically allocating these arrays, you can over-allocate by the alignment value, and then bump the returned pointer up to the alignment you need.
You can also use the posix_memalign() function if you're on a system that supports it. Finally, note that malloc() will always allocate memory aligned to the size of the largest built-in type, generally 8 bytes for a double. If you don't need better than that, then malloc should suffice.
Edit: If you modify your allocation code to force that check to be true (i.e. overallocate, as suggested above), the compiler should oblige by not conditionalizing the loop code. If you needed alignment to an 8-byte boundary, as it seems, that would be something like a = (a + 7) & ~3;.
I get only one version of the loop, using your exact code with these options: gcc -march=core2 -c -O2 -fdump-tree-optimized -ftree-vectorize vec.c
My version of GCC is gcc version 4.4.1 (Ubuntu 4.4.1-4ubuntu8).
GCC is doing something clever here. It forces the arrays a and b to be 16-byte aligned. It doesn't do that to c, presumably because c is never used in a vectorizable loop.

Resources