Halide: How to deal with Expr evaluated as nan or inf? - halide

I have a 1D Func over which I'd like to perform the following: take the sum of a kernel of n values, and divide it by the sum of the kernel shifted by 1. Here's the code I have so far:
Var x("x");
Func result("result");
RDom r(0, kernel_size);
Expr sum1 = sum(vec_func(x+r));
Expr sum2 = sum(vec_func(x+r+1));
Expr quotient = sum1 / sum2;
result(x) = quotient;
This is an example of the type of calculation which might result in a NaN or Inf. Ideally I would be able to deal with this in Halide using something like this:
Expr safe_calc = select(isnan(quotient) || isinf(quotient), 0, quotient);
result(x) = quotient;
Does such a method exist in Halide?

Expr Halide::is_nan(Expr) exists right now, but we are missing is_finite. (Added as https://github.com/halide/Halide/issues/2497)
However: be aware that Halide does floating point math in accordance with -ffast-math rules, which means it is allowed to optimize the code in ways that assume NaN/Inf values can't happen. If it's possible to structure your code in a way to ensure such values aren't possible, you should do so.


Is there any way to compute 1.01^x without loops, using only integer add, mul, sub, div, exp?

Is there any way to implement the Uint256 -> Uint256 function f(x) = floor(1.01 ^ x), using only a constant number of operations add, mul, sub, div, exp, all of those can only operate on integer numbers?
Use Newton's binomial series
(1+h)^x = 1+x*h + x*(x-1)/2*h^2 + x*(x-1)*(x-2)/6*h^3 + ...
= 1 + x*h*(1+(x-1)*h/2*(1+(x-2)*h/3*(1+...)))
To get a terminating computation, one would first have to reduce x, I'd think by multiples of log(2)/log(1.01).
Essentially, in the intermediate result you have to use some kind of fixed point arithmetics.
I would try fixed point. Let assume your Int256 is unsigned 8 bit int then try 8.8 or 8.16 or 8.32 fixed format. Depends on what precision you need.
let rewrite the number to a.b format and assume 8.16 (8.8 ignores too much of the 1.01 binary ones to my taste) so you got:
1.01^x = (1+0.01*65536/65536)^x = (1+655/65536)^x
now just compute integer pow for example with power by squaring see:
Power by squaring for negative exponents
The result then convert back to integer so either
floor (use just a)
round (increment a if b>32767 )
To avoid loops just encode the power by squaring into few copy paste lines (one for each bit).
btw this can be done with +,<< if you realize you are multiplying by:
1.01dec = 1.0000001010001111 bin
So you just can do shift and add for each binary 1 instead of mul in case mul is a problem ...

Generate random floats from random bytes without bit-twiddling

Assuming I have a good-enough(tm) stream of random byte values, is there a mathematical way to convert these into (0 < n < 1) floating-point values that does not need to know the internal format of the floats?
I'm looking for something that:
Doesn't require bitwise operations (on the floats), and
Is an iterative process that we can know will give a good value after n iterations, where n is a function of the output precision.
A general process that can be used for floats of any precision, by simply changing the number of iterations, ie consuming more input bytes to generate a double than a single-precision float.
The naive solution is to just build yourself a big integer from a few bytes, and then simply convert to float divide by 2^n, but I can't see how to do it without messing up the distribution.
Another idea is something like this (pseudocode):
state := 0.0
n := requiredIterations(outputPrecision)
nextByte := getRandomByte()
state := state + nextByte
state := state / 256
return state
It seems like this should work, but I don't know how to prove it :)
ok, I think I've got what you need
let's consider sampling float in the range [0...1) in the following way. 256 is 2^8 which is equivalent to next byte shift. Lets combine bytes as
b0*256*256*256 + b1*256*256 + b2*256 + b3
To get number in [0...1) range you have to divide it by 256*256*256*256, thus
f = b0/256 + b1/(256*256) + b2/(256*256*256) + b3/(256*256*256*256)
which, in turn, is equivalent to Horner scheme of polynomials computation
f = (1/256)*(b0 + (1/256)*(b1 + (1/256)*(b2 + (1/256)*b3)))
which, in turn, pretty much what you wrote (for some abstract N)
As Severin Pappadeux says, why not just do something like
const double factor = 2.32830643653869628906e-10; // 2^(-32)
unsigned int accumulator = 0;
for (int i = 0; i != 4; ++i)
accumulator <<= 8;
accumulator |= getRandomByte();
double r = factor * accumulator;

How to compute the integer absolute value

How to compute the integer absolute value without using if condition.
I guess we need to use some bitwise operation.
Can anybody help?
Same as existing answers, but with more explanations:
Let's assume a twos-complement number (as it's the usual case and you don't say otherwise) and let's assume 32-bit:
First, we perform an arithmetic right-shift by 31 bits. This shifts in all 1s for a negative number or all 0s for a positive one (but note that the actual >>-operator's behaviour in C or C++ is implementation defined for negative numbers, but will usually also perform an arithmetic shift, but let's just assume pseudocode or actual hardware instructions, since it sounds like homework anyway):
mask = x >> 31;
So what we get is 111...111 (-1) for negative numbers and 000...000 (0) for positives
Now we XOR this with x, getting the behaviour of a NOT for mask=111...111 (negative) and a no-op for mask=000...000 (positive):
x = x XOR mask;
And finally subtract our mask, which means +1 for negatives and +0/no-op for positives:
x = x - mask;
So for positives we perform an XOR with 0 and a subtraction of 0 and thus get the same number. And for negatives, we got (NOT x) + 1, which is exactly -x when using twos-complement representation.
Set the mask as right shift of integer by 31 (assuming integers are stored as two's-complement 32-bit values and that the right-shift operator does sign extension).
mask = n>>31
XOR the mask with number
mask ^ n
Subtract mask from result of step 2 and return the result.
(mask^n) - mask
Assume int is of 32-bit.
int my_abs(int x)
int y = (x >> 31);
return (x ^ y) - y;
One can also perform the above operation as:
return n*(((n>0)<<1)-1);
where n is the number whose absolute need to be calculated.
In C, you can use unions to perform bit manipulations on doubles. The following will work in C and can be used for both integers, floats, and doubles.
* Calculates the absolute value of a double.
* #param x An 8-byte floating-point double
* #return A positive double
* #note Uses bit manipulation and does not care about NaNs
double abs(double x)
uint64_t bits;
double dub;
} b;
b.dub = x;
//Sets the sign bit to 0
return b.dub;
Note that this assumes that doubles are 8 bytes.
I wrote my own, before discovering this question.
My answer is probably slower, but still valid:
int abs_of_x = ((x*(x >> 31)) | ((~x + 1) * ((~x + 1) >> 31)));
If you are not allowed to use the minus sign you could do something like this:
int absVal(int x) {
return ((x >> 31) + x) ^ (x >> 31);
For assembly the most efficient would be to initialize a value to 0, substract the integer, and then take the max:
pxor mm1, mm1 ; set mm1 to all zeros
psubw mm1, mm0 ; make each mm1 word contain the negative of each mm0 word
pmaxswmm1, mm0 ; mm1 will contain only the positive (larger) values - the absolute value
In C#, you can implement abs() without using any local variables:
public static long abs(long d) => (d + (d >>= 63)) ^ d;
public static int abs(int d) => (d + (d >>= 31)) ^ d;
Note: regarding 0x80000000 (int.MinValue) and 0x8000000000000000 (long.MinValue):
As with all of the other bitwise/non-branching methods shown on this page, this gives the single non-mathematical result abs(int.MinValue) == int.MinValue (likewise for long.MinValue). These represent the only cases where result value is negative, that is, where the MSB of the two's-complement result is 1 -- and are also the only cases where the input value is returned unchanged. I don't believe this important point was mentioned elsewhere on this page.
The code shown above depends on the value of d used on the right side of the xor being the value of d updated during the computation of left side. To C# programmers this will seem obvious. They are used to seeing code like this because .NET formally incorporates a strong memory model which strictly guarantees the correct fetching sequence here. The reason I mention this is because in C or C++ one may need to be more cautious. The memory models of the latter are considerably more permissive, which may allow certain compiler optimizations to issue out-of-order fetches. Obviously, in such a regime, fetch-order sensitivity would represent a correctness hazard.
If you don't want to rely on implementation of sign extension while right bit shifting, you can modify the way you calculate the mask:
mask = ~((n >> 31) & 1) + 1
then proceed as was already demonstrated in the previous answers:
(n ^ mask) - mask
What is the programming language you're using? In C# you can use the Math.Abs method:
int value1 = -1000;
int value2 = 20;
int abs1 = Math.Abs(value1);
int abs2 = Math.Abs(value2);

How to implement square root and exponentiation on arbitrary length numbers?

I'm working on new data type for arbitrary length numbers (only non-negative integers) and I got stuck at implementing square root and exponentiation functions (only for natural exponents). Please help.
I store the arbitrary length number as a string, so all operations are made char by char.
Please don't include advices to use different (existing) library or other way to store the number than string. It's meant to be a programming exercise, not a real-world application, so optimization and performance are not so necessary.
If you include code in your answer, I would prefer it to be in either pseudo-code or in C++. The important thing is the algorithm, not the implementation itself.
Thanks for the help.
Square root: Babylonian method. I.e.
function sqrt(N):
oldguess = -1
guess = 1
while abs(guess-oldguess) > 1:
oldguess = guess
guess = (guess + N/guess) / 2
return guess
Exponentiation: by squaring.
function exp(base, pow):
result = 1
bits = toBinary(powr)
for bit in bits:
result = result * result
if (bit):
result = result * base
return result
where toBinary returns a list/array of 1s and 0s, MSB first, for instance as implemented by this Python function:
def toBinary(x):
return map(lambda b: 1 if b == '1' else 0, bin(x)[2:])
Note that if your implementation is done using binary numbers, this can be implemented using bitwise operations without needing any extra memory. If using decimal, then you will need the extra to store the binary encoding.
However, there is a decimal version of the algorithm, which looks something like this:
function exp(base, pow):
lookup = [1, base, base*base, base*base*base, ...] #...up to base^9
#The above line can be optimised using exp-by-squaring if desired
result = 1
digits = toDecimal(powr)
for digit in digits:
result = result * result * lookup[digit]
return result
Exponentiation is trivially implemented with multiplication - the most basic implementation is just a loop,
result = 1;
for (int i = 0; i < power; ++i) result *= base;
You can (and should) implement a better version using squaring with divide & conquer - i.e. a^5 = a^4 * a = (a^2)^2 * a.
Square root can be found using Newton's method - you have to get an initial guess (a good one is to take a square root from the highest digit, and to multiply that by base of the digits raised to half of the original number's length), and then to refine it using division: if a is an approximation to sqrt(x), then a better approximation is (a + x / a) / 2. You should stop when the next approximation is equal to the previous one, or to x / a.

What's a good way to add a large number of small floats together?

Say you have 100000000 32-bit floating point values in an array, and each of these floats has a value between 0.0 and 1.0. If you tried to sum them all up like this
result = 0.0;
for (i = 0; i < 100000000; i++) {
result += array[i];
you'd run into problems as result gets much larger than 1.0.
So what are some of the ways to more accurately perform the summation?
Sounds like you want to use Kahan Summation.
According to Wikipedia,
The Kahan summation algorithm (also known as compensated summation) significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach. This is done by keeping a separate running compensation (a variable to accumulate small errors).
In pseudocode, the algorithm is:
function kahanSum(input)
var sum = input[1]
var c = 0.0 //A running compensation for lost low-order bits.
for i = 2 to input.length
y = input[i] - c //So far, so good: c is zero.
t = sum + y //Alas, sum is big, y small, so low-order digits of y are lost.
c = (t - sum) - y //(t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
sum = t //Algebraically, c should always be zero. Beware eagerly optimising compilers!
next i //Next time around, the lost low part will be added to y in a fresh attempt.
return sum
Make result a double, assuming C or C++.
If you can tolerate a little extra space (in Java):
float temp = new float[1000000];
float temp2 = new float[1000];
float sum = 0.0f;
for (i=0 ; i<1000000000 ; i++) temp[i/1000] += array[i];
for (i=0 ; i<1000000 ; i++) temp2[i/1000] += temp[i];
for (i=0 ; i<1000 ; i++) sum += temp2[i];
Standard divide-and-conquer algorithm, basically. This only works if the numbers are randomly scattered; it won't work if the first half billion numbers are 1e-12 and the second half billion are much larger.
But before doing any of that, one might just accumulate the result in a double. That'll help a lot.
If in .NET using the LINQ .Sum() extension method that exists on an IEnumerable. Then it would just be:
var result = array.Sum();
The absolutely optimal way is to use a priority queue, in the following way:
PriorityQueue<Float> q = new PriorityQueue<Float>();
for(float x : list) q.add(x);
while(q.size() > 1) q.add(q.pop() + q.pop());
return q.pop();
(this code assumes the numbers are positive; generally the queue should be ordered by absolute value)
Explanation: given a list of numbers, to add them up as precisely as possible you should strive to make the numbers close, t.i. eliminate the difference between small and big ones. That's why you want to add up the two smallest numbers, thus increasing the minimal value of the list, decreasing the difference between the minimum and maximum in the list and reducing the problem size by 1.
Unfortunately I have no idea about how this can be vectorized, considering that you're using OpenCL. But I am almost sure that it can be. You might take a look at the book on vector algorithms, it is surprising how powerful they actually are: Vector Models for Data-Parallel Computing
