implict SIMD (SSE/AVX) broadcasts with GCC

implict SIMD (SSE/AVX) broadcasts with GCC - gcc

I have manged to convert most of my SIMD code to us the vector extensions of GCC. However, I have not found a good solution for doing a broadcast as follows
__m256 areg0 = _mm256_broadcast_ss(&a[i]);
I want to do
__m256 argeg0 = a[i];
If you see my answer at Mutiplying vector by constant using SSE I managed to get broadcasts working with another SIMD register. The following works:
__m256 x,y;
y = x + 3.14159f; // broadcast x + 3.14159
y = 3.14159f*x; // broadcast 3.14159*x
but this won't work:
__m256 x;
x = 3.14159f; //should broadcast 3.14159 but does not work
How can I do this with GCC?

I think there is currently no direct way and you have to work around it using the syntax you already noticed:
__m256 zero={};
__m256 x=zero+3.14159f;
It may change in the future if we can agree on a good syntax, see PR 55726.
Note that if you want to create a vector { s, s, ... s } with a non-constant float s, the technique above only works with integers, or with floats and -fno-signed-zeros. You can tweak it to __m256 x=s-zero; and it will work unless you use -frounding-math. A last version, suggested by Z boson, is __m256 x=(zero+1.f)*s; which should work in most cases (except possibly with a compiler paranoid about sNaN).

It turns out that with a precise floating point model (e.g. with -O3) that GCC cannot simplify x+0 to x due to signed zero. So x = zero+3.14159f produces inefficient code. However GCC can simplify 1.0*x to just x therefore the efficient solution in this case is.
__m256 x = ((__m256){} + 1)*3.14159f;
https://godbolt.org/g/5QAQkC
See this answer for more details.
A simpler solution is just x = 3.14159f - (__m256){} because x - 0 = x irrespective of signed zero.

Related

Digit recurrence square root

I need to implement a digit recurrence square root for generic floating point format such that exp_size + mant_size + 1 <= 64.
I basically followed the implementation suggested here
handbook of floating point arithmetic in the software implementation of floating point operator.
I've tried to test my implementation (not an exhaustive test) and basically for format like 32 bit it looks like to work fine, while for format like mantissa = 10, exponent = 5 for the input x = 0.25 instead to give me 0.5 it gives me apparently 0.707031.
So i was wandering if for small format maybe the digit recurrence approach has some limits or not or... simply my implementation is bad...
I hope you can help me... it's a pain to implement this stuff from 0...

it is extremly hard to look at your code but you should:
test all the operand combinations
if it works for single example does not mean it works for all of them
check bit masks
you wrote when you use 32bit then result is fine
when use 10 then not
that is hinting overflow somewhere
are you sure you have the right bit counts reserved/masked for R?
R should be 2 bits more then Q (+1 bit for accuracy and +1 bit for sign)
and also you should handle R as twos complement
Q is half of the D bits and unsigned
Could not find your algorithm (that book you linked does not allow me further then page 265 where SQRT starts may be some incompatibility I Use good old Opera) but this is The closest one I found in Google (Non-Restoring-SQRT) in some PDF research and HW implementation on FPGA and after clearing the bugs and testing this is what I code in C++ and tested:
DWORD u32_sqrt(DWORD D) // 32bit
{
const int _bits =32;
const DWORD _R_mask=(4<<(_bits>>1))-1;
const DWORD _R_sign= 2<<(_bits>>1);
DWORD Q=0; // u(_bits/2 ) result (quotient)
DWORD R=0; // i(_bits/2 + 2) 2os complement (remainder) R=D-Q*Q
for (int i=_bits-2;i>=0;i-=2)
{
if (DWORD(R&_R_sign)){ R=(R<<2)|((D>>i)&3); R+=(Q<<2)|3; } // R< 0
else { R=(R<<2)|((D>>i)&3); R-=(Q<<2)|1; } // R>=0
R&=_R_mask; Q<<=1; if (!DWORD(R&_R_sign)) Q|=1; // R>=0
}
return Q;
}

SCIP infeasibility detection with a MINLP

I'm using SCIPAMPL to solve mixed integer nonlinear programming problems (MINLPs). For the most part it's been working well, but I found an instance where the solver detects infeasibility erroneously.
set K default {};
var x integer >= 0;
var y integer >= 0;
var z;
var v1{K} binary;
param yk{K} integer default 0;
param M := 300;
param eps := 0.5;
minimize upperobjf:
16*x^2 + 9*y^2;
subject to
ll1: 4*x + y <= 50;
ul1: -4*x + y <= 0;
vf1{k in K}: z + eps <= (x + yk[k] - 20)^4 + M*(1 - v1[k]);
vf2: z >= (x + y - 20)^4;
aux1{k in K}: -(4*x + yk[k] - 50) <= M*v1[k] - eps;
# fix1: x = 4;
# fix2: y = 12;
let K := {1,2,3,4,5,6,7,8,9,10,11};
for {k in K} let yk[k] := k - 1;
solve;
display x,y,z,v1;
The solver is detecting infeasibility at the presolve phase. However, if you uncomment the two constraints that fix x and y to 4 and 12, the solver works and outputs the correct v and z values.
I'm curious about why this might be happening and whether I can formulate the problem in a different way to avoid it. One suggestion I got was that infeasibility detection is usually not very good with non-convex problems.
Edit: I should mention that this isn't just a SCIP issue. SCIP just hits the issue with this particular set K. If for instance I use bonmin, another global MINLP solver, I can solve the problem for this particular K, but if you expand K to go up to 15, then bonmin detects infeasibility when the problem remains feasible. For that K, I'm yet to find a solver that actually works. I've also tried minlp solvers based on FILTER. I'm yet to try BARON since it only takes GAMS input.

There are very good remarks about modeling issues regarding, e.g., big-M constraints in the comments to your original question. Numerical issues can indeed cause troubles, especially when nonlinear constraints are present.
Depending on how deep you would like to dive into that matter, I see 3 options for you:
You can decrease numeric precision by tuning the parameters numerics/feastol, numerics/epsilon, and numerics/lpfeastol. You can save the following lines in a file "scip.set" and save it to the working directory from where you call scipampl:
# absolute values smaller than this are considered zero
# [type: real, range: [1e-20,0.001], default: 1e-09]
numerics/epsilon = 1e-07
# absolute values of sums smaller than this are considered zero
# [type: real, range: [1e-17,0.001], default: 1e-06]
numerics/sumepsilon = 1e-05
# feasibility tolerance for constraints
# [type: real, range: [1e-17,0.001], default: 1e-06]
numerics/feastol = 1e-05
# primal feasibility tolerance of LP solver
# [type: real, range: [1e-17,0.001], default: 1e-06]
numerics/lpfeastol = 1e-05
You can now test different numerical precisions within scipampl by modifying the file scip.set
Save the solution you obtain by fixing your x and y-variables. If you pass this solution to the model without fixings, you get a message what caused the infeasibility. Usually, you will get a message that some variable bound or constraint is violated slightly outside a tolerance.
If you want to know precisely through which presolver a solution becomes infeasible, or if the former approach does not show any violation, SCIP offers the functionality to read in a debug solution; Specify the solution file "debug.sol" by uncommenting the line in src/scip/debug.h
/* #define SCIP_DEBUG_SOLUTION "debug.sol" */
and recompile SCIP and SCIPAmpl by using
make DBG=true
SCIP checks the debug-solution against every presolving reduction and outputs the presolver which causes the trouble.
I hope this is useful for you.

Looking deeper into this instance, SCIP seems to do something wrong in presolve.
In cons_nonlinear.c:7816 (function consPresolNonlinear), remove the line
if( nrounds == 0 )
so that SCIPexprgraphPropagateVarBounds is executed in any case.
That seems to fix the issue.

How to compute the integer absolute value

How to compute the integer absolute value without using if condition.
I guess we need to use some bitwise operation.
Can anybody help?

Same as existing answers, but with more explanations:
Let's assume a twos-complement number (as it's the usual case and you don't say otherwise) and let's assume 32-bit:
First, we perform an arithmetic right-shift by 31 bits. This shifts in all 1s for a negative number or all 0s for a positive one (but note that the actual >>-operator's behaviour in C or C++ is implementation defined for negative numbers, but will usually also perform an arithmetic shift, but let's just assume pseudocode or actual hardware instructions, since it sounds like homework anyway):
mask = x >> 31;
So what we get is 111...111 (-1) for negative numbers and 000...000 (0) for positives
Now we XOR this with x, getting the behaviour of a NOT for mask=111...111 (negative) and a no-op for mask=000...000 (positive):
x = x XOR mask;
And finally subtract our mask, which means +1 for negatives and +0/no-op for positives:
x = x - mask;
So for positives we perform an XOR with 0 and a subtraction of 0 and thus get the same number. And for negatives, we got (NOT x) + 1, which is exactly -x when using twos-complement representation.

Set the mask as right shift of integer by 31 (assuming integers are stored as two's-complement 32-bit values and that the right-shift operator does sign extension).
mask = n>>31
XOR the mask with number
mask ^ n
Subtract mask from result of step 2 and return the result.
(mask^n) - mask

Assume int is of 32-bit.
int my_abs(int x)
{
int y = (x >> 31);
return (x ^ y) - y;
}

One can also perform the above operation as:
return n*(((n>0)<<1)-1);
where n is the number whose absolute need to be calculated.

In C, you can use unions to perform bit manipulations on doubles. The following will work in C and can be used for both integers, floats, and doubles.
/**
* Calculates the absolute value of a double.
* #param x An 8-byte floating-point double
* #return A positive double
* #note Uses bit manipulation and does not care about NaNs
*/
double abs(double x)
{
union{
uint64_t bits;
double dub;
} b;
b.dub = x;
//Sets the sign bit to 0
b.bits &= 0x7FFFFFFFFFFFFFFF;
return b.dub;
}
Note that this assumes that doubles are 8 bytes.

I wrote my own, before discovering this question.
My answer is probably slower, but still valid:
int abs_of_x = ((x*(x >> 31)) | ((~x + 1) * ((~x + 1) >> 31)));

If you are not allowed to use the minus sign you could do something like this:
int absVal(int x) {
return ((x >> 31) + x) ^ (x >> 31);
}

For assembly the most efficient would be to initialize a value to 0, substract the integer, and then take the max:
pxor mm1, mm1 ; set mm1 to all zeros
psubw mm1, mm0 ; make each mm1 word contain the negative of each mm0 word
pmaxswmm1, mm0 ; mm1 will contain only the positive (larger) values - the absolute value

In C#, you can implement abs() without using any local variables:
public static long abs(long d) => (d + (d >>= 63)) ^ d;
public static int abs(int d) => (d + (d >>= 31)) ^ d;
Note: regarding 0x80000000 (int.MinValue) and 0x8000000000000000 (long.MinValue):
As with all of the other bitwise/non-branching methods shown on this page, this gives the single non-mathematical result abs(int.MinValue) == int.MinValue (likewise for long.MinValue). These represent the only cases where result value is negative, that is, where the MSB of the two's-complement result is 1 -- and are also the only cases where the input value is returned unchanged. I don't believe this important point was mentioned elsewhere on this page.
The code shown above depends on the value of d used on the right side of the xor being the value of d updated during the computation of left side. To C# programmers this will seem obvious. They are used to seeing code like this because .NET formally incorporates a strong memory model which strictly guarantees the correct fetching sequence here. The reason I mention this is because in C or C++ one may need to be more cautious. The memory models of the latter are considerably more permissive, which may allow certain compiler optimizations to issue out-of-order fetches. Obviously, in such a regime, fetch-order sensitivity would represent a correctness hazard.

If you don't want to rely on implementation of sign extension while right bit shifting, you can modify the way you calculate the mask:
mask = ~((n >> 31) & 1) + 1
then proceed as was already demonstrated in the previous answers:
(n ^ mask) - mask

What is the programming language you're using? In C# you can use the Math.Abs method:
int value1 = -1000;
int value2 = 20;
int abs1 = Math.Abs(value1);
int abs2 = Math.Abs(value2);

coding with vectors using the Accelerate framework

I'm playing around with the Accelerate framework for the first time with the goal of implementing some vectorized code into an iOS application. I've never tried to do anything with respect to working with vectors in Objective C or C. Having some experience with MATLAB, I wonder if using Accelerate is indeed that much more of a pain. Suppose I'd want to calculate the following:
b = 4*(sin(a/2))^2 where a and b are vectors.
MATLAB code:
a = 1:4;
b = 4*(sin(a/2)).^2;
However, as I see it after some spitting through the documentation, things are quite different using Accelerate.
My C implementation:
float a[4] = {1,2,3,4}; //define a
int len = 4;
float div = 2; //define 2
float a2[len]; //define intermediate result 1
vDSP_vsdiv(a, 1, &div, a2, 1, len); //divide
float sinResult[len]; //define intermediate result 2
vvsinf(sinResult, a2, &len); //take sine
float sqResult[len]; //square the result
vDSP_vsq(sinResult, 1, sqResult, 1, len); //take square
float factor = 4; //multiply all this by four
float b[len]; //define answer vector
vDSP_vsmul(sqResult, 1, &factor, b, 1, len); //multiply
//unset all variables I didn't actually need
Honestly, I don't know what's worst here: keeping track of all intermediate steps, trying to memorize how the arguments are passed in vDSP with respect to VecLib (quite different), or that it takes so much time doing something quite trivial.
I really hope I am missing something here and that most steps can be merged or shortened. Any recommendations on coding resources, good coding habits (learned the hard way or from a book), etc. would be very welcome! How do you all deal with multiple lines of vector calculations?

I guess you could write it that way, but it seems awfully complicated to me. I like this better (intel-specific, but can easily be abstracted for other architectures):
#include <Accelerate/Accelerate.h>
#include <immintrin.h>
const __m128 a = {1,2,3,4};
const __m128 sina2 = vsinf(a*_mm_set1_ps(0.5));
const __m128 b = _mm_set1_ps(4)*sina2*sina2;
Also, just to be pedantic, what you're doing here is not linear algebra. Linear algebra involves only linear operations (no squaring, no transcendental operations like sin).
Edit: as you noted, the above won't quite work out of the box on iOS; the biggest issue is that there is no vsinf (vMathLib is not available in Accelerate on iOS). I don't have the SDK installed on my machine to test, but I believe that something like the following should work:
#include <Accelerate/Accelerate.h>
const vFloat a = {1, 2, 3, 4};
const vFloat a2 = a*(vFloat){0.5,0.5,0.5,0.5};
const int n = 4;
vFloat sina2;
vvsinf((float *)&sina2, (const float *)&a, &n);
const vFloat b = sina2*sina2*(vFloat){4,4,4,4};
Not quite as pretty as what is possible with vMathLib, but still fairly compact.
In general, a lot of basic arithmetic operations on vectors just work; there's no need to use calls to any library, which is why Accelerate doesn't go out of its way to supply those operations cleanly. Instead, Accelerate usually tries to provide operations that aren't immediately available by other means.

To answer my own question:
In iOS 6, vMathLib will be introduced. As Stephen clarified, vMathLib could already be used on OSX, but it was not available in iOS. Until now.
The functions that vMathLib provides will allow for easier vector calculations.

Making a noise function without binary operators?

Is there any possible way to make pseudorandom numbers without any binary operators? Being that this is a 3D map, I'm trying to make it as a function of X and Y but hopefully include a randomseed somewhere in their so it won't be the same every time. I know you can make a noise function like this with binary operators :
double PerlinNoise::Noise(int x, int y) const
{
int n = x + y * 57;
n = (n << 13) ^ n;
int t = (n * (n * n * 15731 + 789221) + 1376312589) & 0x7fffffff;
return 1.0 - double(t) * 0.931322574615478515625e-9;/// 1073741824.0);
}
But being that I'm using lua instead of C++, I can't use any binary operators. I've tried many different things yet none of them work. Help?

For bit operators (I guess that is what you mean by "binary"), have a look at Bitwise Operators Wiki page, which contains a list of modules you can use, like Lua BitOp and bitlib.
If you do not want to implement it by yourself, have a look at the module lua-noise, which contains an implementation of Perlin noise. Note that it is a work-in-progress C module.

If I'm not mistaken, Matt Zucker's FAQ on Perlin noise only uses arithmetic operators to describe/implement it. It only mentions bitwise operators as an optimization trick.
You should implement both ways and test them with the same language/runtime, to get an idea of the speed difference.

In the above routine, there are not any bit-wise operators that aren't easily converted to arithmetic operations.
The << 13 becomes * 8192
The & 0x7FFFFFFF becomes a mod of 2^31.
As long as overflow isn't an issue, this should be all you need.

It'd be pretty slow, but you could simulate these with division and multiplication, I believe.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

implict SIMD (SSE/AVX) broadcasts with GCC - gcc

Related

Digit recurrence square root

SCIP infeasibility detection with a MINLP

How to compute the integer absolute value

coding with vectors using the Accelerate framework

Making a noise function without binary operators?

Categories

Resources