How to define generic reductions using any binary Operator example OR or AND, custom function? - halide

I believe Halide currently supports sum,minimum,maximum and product which work with RDom. I would like to write a function which does reductions over custom binary operations For Eg. 'AND (&&)', 'OR (||)', '(&)', '(|)' etc. How can I do this in Halide ?
Here are my thoughts on this problem:
Suppose we have a uint8_t input and to perform a reduction using (|),
RDom rw(0,width); rh(0,height);
Func f,g;
f(y) = cast<uint8_t>(0);
f(y) = f(y) | input(rw,y);
g(x) = cast<uint8_t>(0);
g(x) = g(x) | f(rh);
It would be nice to have a Halide function which can perform generic reduction by means of specifying a reduction function (two inputs)
Thanks in advance for your reply.

The helpers sum, product, etc are not actually built-ins. They are just helpers written in the Halide front-end language itself, so you can define more if you like, or make a generic that takes a binary operator. I would start by looking at https://github.com/halide/Halide/blob/master/src/InlineReductions.cpp
The bulk of the magic is just the automatic capturing of free variables, the rest is just like the code you wrote.

Related

jax minimization with stochastically estimated gradients

I'm trying to use the bfgs optimizer from tensorflow_probability.substrates.jax and from jax.scipy.optimize.minimize to minimize a function f which is estimated from pseudo-random samples and has a jax.random.PRNGKey as argument. To use this function with the jax/tfp bfgs minimizer, I wrap the function inside a lambda function
seed = 100
key = jax.random.PRNGKey(seed)
fun = lambda x: return f(x,key)
result = jax.scipy.optimize.minimize(fun = fun, ...)
What is the best way to update the key when the minimization routine calls the function to be minimized so that I use different pseudo-random numbers in a reproducible way? Maybe a global key variable? If yes, is there an example I could follow?
Secondly, is there a way to make the optimization stop after a certain amount of time, as one could do with a callback in scipy? I could directly use the scipy implementation of bfgs/ l-bfgs-b/ etc and use jax ony for the estimation of the function and of tis gradients, which seems to work. Is there a difference between the scipy, jax.scipy and tfp.jax bfgs implementations?
Finally, is there a way to print the values of the arguments of fun during the bfgs optimization in jax.scipy or tfp, given that f is jitted?
Thank you!
There is no way to do what you're asking with jax.scipy.optimize.minimize, because the minimizer does not offer any means to track changing state between function calls, and does not provide for any inbuilt stochasticity in the optimizer.
If you're interested in stochastic optimization in JAX, you might try stochastic optimization in JAXOpt, which provides a much more flexible set of optimization routines.
Regarding your second question, if you'd like to print values during the course of a jit-compiled optimization or other loop, you can use jax.debug.print.

How can I specialize low-level functions for performance while keeping high-level functions polymorphic?

I extracted the following minimal example from my production project. My machine learning project is made up of a linear algebra library, a deep learning library, and an application.
The linear algebra library contains a module for matrices based on storable vectors:
module Matrix where
import Data.Vector.Storable hiding (sum)
data Matrix a = Matrix { rows :: Int, cols :: Int, items :: Vector a } deriving (Eq, Show, Read)
item :: Storable a => Int -> Int -> Matrix a -> a
item i j m = unsafeIndex (items m) $ i * cols m + j
multiply :: Storable a => Num a => Matrix a -> Matrix a -> Matrix a
multiply a b = Matrix (rows a) (cols b) $ generate (rows a * cols b) (f . flip divMod (cols b)) where
f (i, j) = sum $ (\ k -> item i k a * item k j b) <$> [0 .. cols a - 1]
The deep learning library uses the linear algebra library to implement the forward pass through a deep neural network:
module Deep where
import Foreign.Storable
import Matrix
transform :: Storable a => Num a => [Matrix a] -> Matrix a -> Matrix a
transform layers batch = foldr multiply batch layers
And finally the application uses the deep learning library:
import qualified Data.Vector.Storable as VS
import Test.Tasty.Bench
import Matrix
import Deep
main :: IO ()
main = defaultMain [bmultiply] where
bmultiply = bench "bmultiply" $ nf (items . transform layers) batch where
m k l c = Matrix k l $ VS.replicate (k * l) c :: Matrix Double
layers = m 256 256 <$> [0.1, 0.2, 0.3]
batch = m 256 100 0.4
I like the fact that the deep learning library and with some exceptions related to BLAS via FFI also the linear algebra library do not have to worry about concrete types like Float or Double. Unfortunately, this also means that unless specialization takes place, they use boxed values and performance is about 60x worse than it could be (959 ms instead of 16.7 ms).
The only way I have found to get good performance is to force either inlining or specialization throughout the entire call hierarchy via compiler pragmas. This is very annoying because the performance issue that fundamentally should be specific to the multiply function now "infects" the entire code base. Even very high-level functions using multiply via 5 levels of indirection and several intermediate libraries somehow have to "know" about technical specialization issues deep down.
In my actual production code, many more functions are affected than in this minimal example. Forgetting to annotate just a single one of these functions with the right compiler pragma immediately destroys the performance. Additionally, when developing a library, I have no way of knowing which types it will be used with, so specialization pragmas are not an option anyways.
This is particularly unfortunate because all the performance-critical tight loops are wholly contained within the multiply function. The function itself is only called a handful of times and it would not hurt performance if values were only unboxed dynamically whenever multiply is called. In the end, there is really no need for values to be specialized and unboxed inside the high-level machine learning functions. I feel like there should be a way to pass the request for specialization through to the low-level functions while keeping high- and intermediate-level functions polymorphic.
How is this problem typically solved in Haskell? If I develop a library that uses the vector package to generate blazingly-fast code in tight loops, how do I pass that performance on to users of my library without losing all polymorphism or forcing everything to be inlined?
Is there a way to pay the price for polymorphism (in the form of boxing) only within the high-level functions and specialize and unbox only at the boundary to the functions that need it, rather than having specialization "infect" the entire call hierarchy?
If you browse the source for, say, the vector package, you'll find that nearly every function has an INLINABLE or INLINE pragma, whether the function is part of the low-level, performance critical core or part of a high-level generic interface. You'll see something similar if you look at lens or hmatrix, etc.
So, the short answer is: no, the only way to get good performance with your current design will be to infect the entire call hierarchy with pragmas. The best way to avoid missing a pragma and tanking performance will be to have an exhaustive set of benchmarks that can detect performance regressions.
There are a few compiler flags that might be helpful. The flag -fexpose-all-unfoldings makes sure that inlinable versions of all functions find their way into the interface files, while the flag -fspecialise-aggressively looks for any opportunity to specialize those functions. Together, they are kind of like turning on INLINE for every function. This probably isn't a good permanent solution, but it might be useful during development or as a sanity check to get some baseline performance numbers.

How to count all the permutations of a string in Microsoft Small Basic?

How to count all the permutations of a string in Microsoft Small Basic?
The brute-force exploring of all permutations is usually done by recursion in languages such as C and C++. However, Microsoft Small Basic doesn't support the arguments for functions, so it's impossible to implement the recursive algorithm the same way.
Perhaps it's doable in Small Basic using the Stack? How exactly?
You can't use arguments in functions in Smallbasic, but since all variables are global, you can simply set them before you call the function and use them in the function. It is also possible for a function to call itself. This means that you can use functions (or subroutines as they are called in SB) for 'brute forcing' this algorithm.
See here:
Sub Printx
TextWindow.WriteLine(x)
x = x + 1
Printx()
EndSub
x = 1
Printx()
Note that this way of doing things may crash the program after around 2,000 'recalls' of the subroutine as it throws a stackoverflow error.

construct a structured matrix efficiently in fortran

Having left Fortran for several years, now I have to pick it up and start to work with it again.
I'd like to construct a matrix with entry(i,j) in the form f(x_i,y_j), where f is a function of two variables, e.g., f(x,y)=cos(x-y). In Matlab or Python(Numpy), there are efficient ways to handle this kind of specific issue. I wonder whether there is such optimization in Fortran.
BTW, is it also true in Fortran that a vectorized operation is faster than a do/for loop (as is the case in Matlab and Numpy) ?
If you mean by vectorized the same as you mean in Matlab and Python, the short form you call on whole array then no, these forms are often slower, because they mey be harder to optimize than simple loops. What is faster is when the compiler actually uses the vector instructions of the CPU, but that is something else. And it is easier for the compiler to use them for simple loops.
Fortran has elemental functions, do concurrent, forall and where constructs, implied loops and array constructors. There is no point repeating them here, they have been described many times on this site or in tutorials.
Your example is most simply done using a loop
do j = 1, ny
do i = 1, nx
entry(i,j) = f(x(i), y(j))
end do
end do
One of the short ways, you probably meant by Python-like vectorization, would be the whole-array operations, e.g.,
A = cos(B)
C = A * B
D = f(A*B)
and similar. The function (which is called on each element of the array), must be elemental. These operations are not necessarily efficient. For example, the last call may require a temporary array to be created, which would be avoided when using a loop.

Derivative of a program

Let us assume you can represent a program as mathematical function, that's possible. How does the program representation of the first derivative of that function look like? Is there a way to transform a program to its "derivative" form, and does this make sense at all?
Yes it does make sense, it's known as Automatic Differentiation. There are one or two experimental compilers which can do this, for example NAGware's Differentiation Enabled Fortran Compiler Technology. And there are a lot of research papers on the topic. I suggest you get Googling.
First, it only makes sense to try to get the derivative of a pure function (one that does not affect external state and returns the exact same output for every input). Second, the type system of many programming languages involves a lot of step functions (e.g. integers), meaning you'd have to get your program to work in terms of continuous functions in order to get a valid first derivative. Third, getting the derivative of any function involves breaking it down and manipulating it symbolically. Thus, you can't get the derivative of a function without knowing how what operations it is made of. This could be achieved with reflection.
You could create a derivative approximation function if your programming language supports closures (that is, nested functions and the ability to put functions into variables and return them). Here is a JavaScript example taken from http://en.wikipedia.org/wiki/Closure_%28computer_science%29 :
function derivative(f, dx) {
return function(x) {
return (f(x + dx) - f(x)) / dx;
};
}
Thus, you could say:
function f(x) { return x*x; }
f_prime = derivative(f, 0.0001);
Here, f_prime will approximate function(x) {return 2*x;}
If a programming language implemented higher-order functions and enough algebra, one could implement a real derivative function in it. That would be really cool.
See Lambda the Ultimate discussions on Derivatives and dissections of data types and Derivatives of Regular Expressions
How do you define the mathematical function of a program?
A derivative represent the rate of change of a function. If your function isn't continuous its derivative will be undefined over most of the domain.
I'm just gonna say that this doesn't make a lot of sense, as a program is much more abstract and "ruleless" than a mathematical function. As a derivative is a measure of the change in output as the input changes, there are certainly some programs where this could apply. However, you'd need to be able to quantify your input/output both in numerical terms.
Since input/output would both numerical, it's reasonable to assume that your program represents or operates similarly to a mathematical function, or series of functions. Hence, you can easily represent a derivative, but it would be no different than converting the mathematical derivative of a function to a computer program.
If the program is denoted as a distribution (Schwartz) then you have some notion of derivative assuming that tests functions models your postcondition (you can still take the limit to get a characteristic function). For instance, the assignment x:=x+1 is associated to the Dirac distribution \delta_{x_0+1} where x_0 is the initial value of the variable x. However, I have no idea what is the computational meaning of \delta_{x_0+1}'.
I am wondering, what if the program your're trying to "derive" uses some form of heursitics ? How can it be derived then ?
Half-jokingly, we all know that all real programs use at least a rand().

Resources