Floating point compare of absolute values in AVX

Floating point compare of absolute values in AVX - performance

I would like to compare two vectors of doubles based on their absolute values.
That is, the vector equivalent of the following:
if (fabs(x) < fabs(y)) {
...
}
Is there anything better than just taking the absolute value of each side and following up with a _mm256_cmp_pd?
Interested in all of AVX, AVX2, and AVX-512 flavors.

With AVX-512 you can save one µop. Instead of 2xvandpd+vcmppd you can use
vpternlogq+vpcmpuq. Note that the solution below assumes that the numbers are
not a NaN.
IEEE-754 floating point numbers have the nice property that they are encoded
such that if x[62:0] integer_less_than y[62:0], then as a floating point:
abs(x)<abs(y).
So, instead of setting both sign bits to 0, we can copy the sign bit of x
to the sign bit of y and compare the result as an unsigned integer.
In the (untested) code below, for negative x both xi[63] and yi_sgnx[63] are 1,
while for positive x, both xi[63] and yi_sgnx[63] are 0.
So the unsigned integer compare actually compares xi[62:0] with yi[62:0], which is just what we need for the comparison abs(x)<abs(y).
The vpternlog instruction is suitable for copying the sign bit, see here or here.
I'm not sure if the constants z and 0xCA are chosen correctly.
__mmask8 cmplt_via_ternlog(__m512d x, __m512d y){
__m512i xi = _mm512_castpd_si512(x);
__m512i yi = _mm512_castpd_si512(x);
__m512i z = _mm512_set1_epi64(0x7FFFFFFFFFFFFFFFull);
__m512i yi_sgnx = _mm512_ternarylogic_epi64(z, yi, xi, 0xCA);
return _mm512_cmp_epu64_mask(xi, yi_sgnx, 1); /* _CMPINT_LT */
}

Related

Efficiently transform float and double values into bytes that preserve the comparison relationship between values

I need a method by which to efficiently translate any float or double value to an array of bytes so that it preserves the comparison relationship to any other value.
Example: V1 and V2 are turned into arrays A1 and A2. If A1[0]<A2[0], then V1 must be smaller than V2. Same for larger. If A1[0]==A2[0] and A1[1]>A2[1] then V1 must be larger than V2. And so on. If all the bytes are the same, then the values V1 and V2 must be equal.
For a four byte integer I, an array that would satisfy the above condition would be [U>>24, (U>>16)&255, (U>>8)&255, U&255], where U is the uint positive value V-int.MinValue.
Since doubles are stored as 8 bytes, I expect something close to 8 bytes.
Do you think such a thing can be achieved? Thanks!
C# solution is preferred.

The standard representation for doubles and floats that is used by most languages, IEEE 754, is already very close to supporting this requirement.
In C#, you can use BitConverter.DoubleToInt64Bits or SingleToInt32Bits to get the underlying bits of a double or float directly as an integer.
In order to make comparisons work out right, you only have to fix up the way negative numbers are handled:
long bits = BitConverter.DoubleToInt64Bits( theDouble );
if (bits < 0L) {
bits ^= Int64.MaxValue;
}
The resulting longs will then have the same numeric order as the corresponding doubles. This works for all values except Nan, which isn't really comparable to anything else. The infinities, +0.0 and -0.0 work fine.
If you want +0.0 and -0.0 to have the same value, you can do this:
long bits = BitConverter.DoubleToInt64Bits( theDouble );
if (bits < 0L) {
bits = (bits^Int64.MaxValue)+1L;
}
Note that if you want to make your byte array, you'll probably want to convert to an unsigned integer. You need to flip the sign bit if you want to preserve the ordering, or just do it like this:
long bits = BitConverter.DoubleToInt64Bits( theDouble );
ulong arraybits;
if (bits >= 0L) {
arraybits = (1UL<<63) + (ulong)bits;
} else {
arraybits = (ulong)~bits;
}

Sampling from all possible floats in D

In the D programming language, the standard random (std.random) module provides a simple mechanism for generating a random number in some specified range.
auto a = uniform(0, 1024, gen);
What is the best way in D to sample from all possible floating point values?
For clarification, sampling from all possible 32-bit integers can be done as follows:
auto l = uniform!int(); // randomly selected int from all possible integers

Depends on the kind of distribution you want.
A uniform distribution over all possible values could be done by generating a random ulong and then casting the bits into floating point. For T being float or double:
union both { ulong input; T output; }
both val;
val.input = uniform!"[]"(ulong.min, ulong.max);
return val.output;
Since roughly half of the positive floating point numbers are between 0 and 1, this method will often give you numbers near zero.`It will also give you infinity and NaN values.
Aside: This code should be fine with D, but would be undefined behavior in C/C++. Use memcpy there.
If you prefer a uniform distribution over all possible numbers in floating point (equal probability for 0..1 and 1..2 etc), you need something like the normal uniform!double, which unfortunately does not work very well for large numbers. It also will not generate infinity or NaN. You could generate double numbers and convert them to float, but I have no answer for generating random large double numbers.

Should a Float or Int be used in this RNG?

I am using a simple Linear Congruential Generator to generate random numbers. The problem is, the result is behaving inconsistently depending on if I use Floats (known as Numbers in some languages) or Ints
// Variable definitions
var _seed:int = 1;
const MULTIPLIER:int = 48271;
const MODULUS:int = 2147483647; // 0x7FFFFFFF (31 bit integer)
// Inside the function
return _seed = ((_seed * MULTIPLIER) % MODULUS) & MODULUS;
The part I'm having difficulties with is the (_seed * MULTIPLIER) part. If _seed and MULTIPLIER are Ints, the int*int multiplication ensues, and most languages give an int as a result. The problem is, if that int is too large, the resulting value is truncated down.
Is this integer overflow behavior "supposed to be done" in RNGs, or should I cast _seed and MULTIPLIER to Floats before the multiplication in order to allow for larger variables?

LCG's are implemented with integer arithmetic because floating point arithmetic is only approximate - a floating point implementation will diverge from the integer implementation and won't yield full cycle for the generator. Even a double only has 52 mantissa bits, which is fewer than required to store the product of two 32 bit ints with guaranteed precision. With modulo arithmetic it's the low bits that are significant, and they're the ones at risk of getting lopped off.
Solutions:
You should be doing the intermediate arithmetic using 64 bit integers, then
cast/convert the result back to 32 bit ints after the modulo operation.
Explicitly break up the multiplication into low bits/high bits
components, and then recombine them after the modulo operation.
This is what Schrage did to achieve this portable FORTRAN
implementation of a relatively popular (at the time) LCG.

How to compute the integer absolute value

How to compute the integer absolute value without using if condition.
I guess we need to use some bitwise operation.
Can anybody help?

Same as existing answers, but with more explanations:
Let's assume a twos-complement number (as it's the usual case and you don't say otherwise) and let's assume 32-bit:
First, we perform an arithmetic right-shift by 31 bits. This shifts in all 1s for a negative number or all 0s for a positive one (but note that the actual >>-operator's behaviour in C or C++ is implementation defined for negative numbers, but will usually also perform an arithmetic shift, but let's just assume pseudocode or actual hardware instructions, since it sounds like homework anyway):
mask = x >> 31;
So what we get is 111...111 (-1) for negative numbers and 000...000 (0) for positives
Now we XOR this with x, getting the behaviour of a NOT for mask=111...111 (negative) and a no-op for mask=000...000 (positive):
x = x XOR mask;
And finally subtract our mask, which means +1 for negatives and +0/no-op for positives:
x = x - mask;
So for positives we perform an XOR with 0 and a subtraction of 0 and thus get the same number. And for negatives, we got (NOT x) + 1, which is exactly -x when using twos-complement representation.

Set the mask as right shift of integer by 31 (assuming integers are stored as two's-complement 32-bit values and that the right-shift operator does sign extension).
mask = n>>31
XOR the mask with number
mask ^ n
Subtract mask from result of step 2 and return the result.
(mask^n) - mask

Assume int is of 32-bit.
int my_abs(int x)
{
int y = (x >> 31);
return (x ^ y) - y;
}

One can also perform the above operation as:
return n*(((n>0)<<1)-1);
where n is the number whose absolute need to be calculated.

In C, you can use unions to perform bit manipulations on doubles. The following will work in C and can be used for both integers, floats, and doubles.
/**
* Calculates the absolute value of a double.
* #param x An 8-byte floating-point double
* #return A positive double
* #note Uses bit manipulation and does not care about NaNs
*/
double abs(double x)
{
union{
uint64_t bits;
double dub;
} b;
b.dub = x;
//Sets the sign bit to 0
b.bits &= 0x7FFFFFFFFFFFFFFF;
return b.dub;
}
Note that this assumes that doubles are 8 bytes.

I wrote my own, before discovering this question.
My answer is probably slower, but still valid:
int abs_of_x = ((x*(x >> 31)) | ((~x + 1) * ((~x + 1) >> 31)));

If you are not allowed to use the minus sign you could do something like this:
int absVal(int x) {
return ((x >> 31) + x) ^ (x >> 31);
}

For assembly the most efficient would be to initialize a value to 0, substract the integer, and then take the max:
pxor mm1, mm1 ; set mm1 to all zeros
psubw mm1, mm0 ; make each mm1 word contain the negative of each mm0 word
pmaxswmm1, mm0 ; mm1 will contain only the positive (larger) values - the absolute value

In C#, you can implement abs() without using any local variables:
public static long abs(long d) => (d + (d >>= 63)) ^ d;
public static int abs(int d) => (d + (d >>= 31)) ^ d;
Note: regarding 0x80000000 (int.MinValue) and 0x8000000000000000 (long.MinValue):
As with all of the other bitwise/non-branching methods shown on this page, this gives the single non-mathematical result abs(int.MinValue) == int.MinValue (likewise for long.MinValue). These represent the only cases where result value is negative, that is, where the MSB of the two's-complement result is 1 -- and are also the only cases where the input value is returned unchanged. I don't believe this important point was mentioned elsewhere on this page.
The code shown above depends on the value of d used on the right side of the xor being the value of d updated during the computation of left side. To C# programmers this will seem obvious. They are used to seeing code like this because .NET formally incorporates a strong memory model which strictly guarantees the correct fetching sequence here. The reason I mention this is because in C or C++ one may need to be more cautious. The memory models of the latter are considerably more permissive, which may allow certain compiler optimizations to issue out-of-order fetches. Obviously, in such a regime, fetch-order sensitivity would represent a correctness hazard.

If you don't want to rely on implementation of sign extension while right bit shifting, you can modify the way you calculate the mask:
mask = ~((n >> 31) & 1) + 1
then proceed as was already demonstrated in the previous answers:
(n ^ mask) - mask

What is the programming language you're using? In C# you can use the Math.Abs method:
int value1 = -1000;
int value2 = 20;
int abs1 = Math.Abs(value1);
int abs2 = Math.Abs(value2);

linear interpolation on 8bit microcontroller

I need to do a linear interpolation over time between two values on an 8 bit PIC microcontroller (Specifically 16F627A but that shouldn't matter) using PIC assembly language. Although I'm looking for an algorithm here as much as actual code.
I need to take an 8 bit starting value, an 8 bit ending value and a position between the two (Currently represented as an 8 bit number 0-255 where 0 means the output should be the starting value and 255 means it should be the final value but that can change if there is a better way to represent this) and calculate the interpolated value.
Now PIC doesn't have a divide instruction so I could code up a general purpose divide routine and effectivly calculate (B-A)/(x/255)+A at each step but I feel there is probably a much better way to do this on a microcontroller than the way I'd do it on a PC in c++
Has anyone got any suggestions for implementing this efficiently on this hardware?

The value you are looking for is (A*(255-x)+B*x)/255. It requires only 8x8 multiplication, and a final division by 255, which can be approximated by simply taking the high byte of the sum.
Choosing x in range 0..128, no approximation is needed: take the high byte of (A*(128-x)+B*x)<<1.

Assuming you interpolate a sequence of values where the previous endpoint is the new start point:
(B-A)/(x/255)+A
sounds like a bad idea. If you use base 255 as a fixedpoint representation, you get the same interpolant twice. You get B when x=255 and B as the new A when x=0.
Use 256 as the fixedpoint system. Divides become shifts, but you need 16-bit arithmetic and 8x8 multiplication with a 16-bit result. The previous issue can be fixed by simply ignoring any bits in the higher-bytes as x mod 256 becomes 0. This suggestion uses 16-bit multiplication, but can't overflow. and you don't interpolate over the same x twice.
interp = (a*(256 - x) + b*x) >> 8
256 - x becomes just a subtract-with-borrow, as you get 0 - x.
The PIC lacks these operations in its instruction set:
Right and left shift. (both logical and arithmetic)
Any form of multiplication.
You can get right-shifting by using rotate-right instead, followed by masking out the extra bits on the left with bitwise-and. A straight-forward way to do 8x8 multiplication with 16-bit result:
void mul16(
unsigned char* hi, /* in: operand1, out: the most significant byte */
unsigned char* lo /* in: operand2, out: the least significant byte */
)
{
unsigned char a,b;
/* loop over the smallest value */
a = (*hi <= *lo) ? *hi : *lo;
b = (*hi <= *lo) ? *lo : *hi;
*hi = *lo = 0;
while(a){
*lo+=b;
if(*lo < b) /* unsigned overflow. Use the carry flag instead.*/
*hi++;
--a;
}
}

The techniques described by Eric Bainville and Mads Elvheim will work fine; each one uses two multiplies per interpolation.
Scott Dattalo and Tony Kubek have put together a super-optimized PIC-specific interpolation technique called "twist" that is slightly faster than two multiplies per interpolation.
Is using this difficult-to-understand technique worth running a little faster?

You could do it using 8.8 fixed-point arithmetic. Then a number from range 0..255 would be interpreted as 0.0 ... 0.996 and you would be able to multiply and normalize it.
Tell me if you need any more details or if it's enough for you to start.

You could characterize this instead as:
(B-A)*(256/(x+1))+A
using a value range of x=0..255, precompute the values of 256/(x+1) as a fixed-point number in a table, and then code a general purpose multiply, adjust for the position of the binary point. This might not be small spacewise; I'd expect you to need a 256 entry table of 16 bit values and the multiply code. (If you don't need speed, this would suggest your divison method is fine.). But it only takes one multiply and an add.
My guess is that you don't need every possible value of X. If there are only a few values of X, you can compute them offline, do a case-select on the specific value of X and then implement the multiply in terms of a fixed sequence of shifts and adds for the specific value of X. That's likely to be pretty efficient in code and very fast for a PIC.

Interpolation
Given two values X & Y , its basically:
(X+Y)/2
or
X/2 + Y/2 (to prevent the odd-case that A+B might overflow the size of the register)
Hence try the following:
(Pseudo-code)
Initially A=MAX, B=MIN
Loop {
Right-Shift A by 1-bit.
Right-Shift B by 1-bit.
C = ADD the two results.
Check MSB of 8-bit interpolation value
if MSB=0, then B=C
if MSB=1, then A=C
Left-Shift 8-bit interpolation value
}Repeat until 8-bit interpolation value becomes zero.
The actual code is just as easy. Only i do not remember the registers and instructions off-hand.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Floating point compare of absolute values in AVX - performance

Related

Efficiently transform float and double values into bytes that preserve the comparison relationship between values

Sampling from all possible floats in D

Should a Float or Int be used in this RNG?

How to compute the integer absolute value

linear interpolation on 8bit microcontroller

Categories

Resources