Is it possible to create an ImageParam with a predefined size for a Halide pipeline? - halide

Currently, I am trying to cross compile a Halide pipeline. The arguments for the pipeline were initially
ImageParam ifm(type_of<float>(), 4, "ifm");
ImageParam kernel(type_of<float>(), 4, "kernel");
ImageParam bias(type_of<float>(), 1, "bias");
Param<int> stride_x;
Param<int> stride_y;
Param<int> pad_x;
Param<int> pad_y;
vector<Argument> args(7);
args[0] = ifm;
args[1] = kernel;
args[2] = bias;
args[3] = pad_x;
args[4] = pad_y;
args[5] = stride_x;
args[6] = stride_y;
After successfully compiling and testing the function, it was found that the function call was taking a good amount of time for each run. I tried to minimize the number of arguments to speed up the call and it worked. Currently, the arguments being passed to the function is just the ifm.(I am trying to perform convolution)
ImageParam ifm(type_of<float>(), 4, "ifm");
vector<Argument> args(1);
args[0] = ifm;
It took a lot less time for each pipeline call. I figured that Halide might be JIT compiling the pipeline before running it and that is adding the overhead as less arguments results in less dimension resolves during run time.(Not sure, can somebody confirm?)
To fully statically compile the pipeline, I think that ImageParam could be constructed with a predefined size as I already know the size of the image being passed to the pipeline.I looked at the ImageParam class but found no constructor that would suit my needs. Is there any workaround?
P.S. Is there is any other way of cross compiling Halide pipelines? I have checked generators but I think it also has the same shortcoming.

There is no constructor that allows you to set the size of the input. However, once your ImageParam is constructed, you can use the ImageParam::dim() method to set the dimensions. For example, in your case, if the input sizes are stored in a vector named dims you could call
for (int i = 0; i < 4; ++i) {
ifm.dim(i).set_bounds(0, dims[i]);
}

Related

Eigen3 : dynamic matrix along one dimension only : failure to contruct

I can easily define a matrix type of fixed lines number and undefined columns number :
typedef Matrix<double, 6, Dynamic> dMat6rowsNCols;
dMat6rowsNCols M1;
But I couldn't figure out how to instanciate it, those attempts don't compile :
M1.Zero(4);
M1 = dMat6rowsNCols::Zero(4);
May I ask you some hints ?
Cheers
Sylvain
From the documentation of DenseBase::Zero(Index):
This is only for vectors (either row-vectors or column-vectors), i.e. matrices which are known at compile-time to have either one row or one column.
You have to use DenseBase::Zero(Index, Index):
M1 = dMat6rowsNCols::Zero(6, 4);
If you are using the current Eigen trunk you can use Eigen::NoChange for some functions, e.g.
M1.setZero(Eigen::NoChange, 4);
As a side note: Calling M1.Zero does not call the member function that sets all coefficients to 0 but is a less common way of calling the static function DenseBase::::Zero. You are probably looking for M1.setZero.

How can I pass multiple parameters to a parallel operation in Octave?

I wrote a function that acts on each combination of columns in an input matrix. It uses multiple for loops and is very slow, so I am trying to parallelize it to use the maximum number of threads on my computer.
I am having difficulty finding the correct syntax to set this up. I'm using the Parallel package in octave, and have tried several ways to set up the calls. Here are two of them, in a simplified form, as well as a non-parallel version that I believe works:
function A = parallelExample(M)
pkg load parallel;
# Get total count of columns
ct = columns(M);
# Generate column pairs
I = nchoosek([1:ct],2);
ops = rows(I);
slice = ones(1, ops);
Ic = mat2cell(I, slice, 2);
## # Non-parallel
## A = zeros(1, ops);
## for i = 1:ops
## A(i) = cmbtest(Ic{i}, M);
## endfor
# Parallelized call v1
A = parcellfun(nproc, #cmbtest, Ic, {M});
## # Parallelized call v2
## afun = #(x) cmbtest(x, M);
## A = parcellfun(nproc, afun, Ic);
endfunction
# function to apply
function P = cmbtest(indices, matrix)
colset = matrix(:,indices);
product = colset(:,1) .* colset(:,2);
P = sum(product);
endfunction
For both of these examples I generate every combination of two columns and convert those pairs into a cell array that the parcellfun function should split up. In the first, I attempt to convert the input matrix M into a 1x1 cell array so it goes to each parallel instance in the same form. I get the error 'C must be a cell array' but this must be internal to the parcellfun function. In the second, I attempt to define an anonymous function that includes the matrix. The error I get here specifies that 'cmbtest' is undefined.
(Naturally, the actual function I'm trying to apply is far more complex than cmbtest here)
Other things I have tried:
Put M into a global variable so it doesn't need to be passed. Seemed to be impossible to put a global variable in a function file, though I may just be having syntax issues.
Make cmbtest a nested function so it can access M (parcellfun doesn't support that)
I'm out of ideas at this point and could use help figuring out how to get this to work.
Converting my comments above to an answer.
When performing parallel operations, it is useful to think of each parallel worker that will result as separate and independent octave instances, which need to have appropriate access to all functions and variables they will require in order to do their independent work.
Therefore, do not rely on subfunctions when calling parcellfun from a main function, since this might lead to errors if the worker is unable to access the subfunction directly under the hood.
In this case, separating the subfunction into its own file fixed the problem.

Eigen cast with auto return type - Less efficient than explicit return type?

When casting a vector integers (i.e. Eigen::VectorXi) to a vector of doubles, and then operating on that vector of doubles, the generated assembly is dramatically different if the return type of the cast is auto.
In other words, using:
Eigen::VectorXi int_vec(3);
int_vec << 1, 2, 3;
Eigen::VectorXd dbl_vec = int_vec.cast<double>();
Compared to:
Eigen::VectorXi int_vec(3);
int_vec << 1, 2, 3;
auto dbl_vec = int_vec.cast<double>();
Here are two examples on godbolt:
VectorXd return type: https://godbolt.org/z/0FLC4r
auto return type: https://godbolt.org/z/MGxCaL
What are the ramifications of using auto for the return here? I thought it would be more efficient by avoiding a copy, but now I'm not sure.
Indeed, in your code in the question you avoid a copy (indeed, until dbl_vec is used, it's essentially a noop). However, in the code on godbolt, you traverse the original int_vec and evaluate dbl_vec at least twice, possibly thrice:
max + std::log((dbl_vec.array() - max)
^^^ ^^^^^^^ ^^^
I'm not sure if the two calls to max are collapsed into a temporary or not. I'd hope so.
In any case, kmdreko is right and you should avoid using auto with Eigen unless you know exactly what you're doing. In this case, the auto is an expression template that does not get evaluated until used. If you use it more than once, then it gets evaluated more than once. If the evaluation is expensive, then the savings from not using a copy are lost (with interest) to the additional evaluation times.

Halide external function call from generator

I want to implement simple image processing routine quite similar to Auto Levels, so need to precalculate thresholds, make LUT and then make histogram stretching/normalization applying LUT.
But my question is not about algorithm side, it is about using extern defined functions, because i need a couple of while cycles for LUT calculation and i think using extern is good for it.
I tried following examples from Halide sources and checked this question too
I use AOT compilation currently testing on PC(winx64), aiming for arm in future, and have the following generator code:
Var x("x"), y("y");
Func make_a_root{ "make_a_root" };
Buffer<bitType> Lut{256, "lut"};
make_a_root(x, y) = inputY(x, y);
ExternFuncArgument arg = make_a_root;
Func g;
g.define_extern("generateAutoLevelsLut", { arg }, UInt(8), 2, Halide::NameMangling::CPlusPlus);
g.compute_root();
inputY has Input<Buffer<uint8_t>> inputY{ "input_y", 2 }; type
First i just want to make it run the call, so function body makes nothing but print (can i define function in same cpp file as generator?)
int generateAutoLevelsLut(halide_buffer_t * input, halide_buffer_t * out)
{
printf("\nextern call\n");
return 0;
}
I tried default mangling with extern "C" too.
Never succeeded getting print message though, so my question is, why this happenin. Is it just misunderstanding on some syntax or are there any problem with calling extern function from generator code?
EDIT:
Added usage of extern like 'out(x,y) = g(x,y)' (lvalue should be actually used!) , and it started to make a call. Now struggling with host == NULL. Digging into bounds inference stuff.
EDIT 2:
I added basic bounds inference checks, now it does not crash.. The next problem i have now, is: Is it possible to make call to external function, without actually influencing output result in direct manner?
Let me concretise what i mean.
The generator code looks like following:
Buffer<bitType> lut{256, "lut"};
args[0] = inputY;
args[1] = lut;
g.define_extern("generateAutoLevelsLut", args, { UInt(8) }, 2, Halide::NameMangling::C);
outputY(x, y) = g(x, y); // Call line
g.compute_root();
outputY.compute_root();
Extern functon code fills second input 'lut' with some dummy LUT:
Halide::Runtime::Buffer<uint16_t> im2Buffer(*input2);
Mat im2Mat(Size(im2Buffer.width(), im2Buffer.height()), CVC_8U, im2Buffer.data(), im2Buffer.stride(1));
for (int i = 0; i < 256; i++)
im2Mat.at<uchar>(i) = i;
And if i comment the 'Call line' in generator, it optimizes away the call to extern at all.
I want to make something like:
Func lutRoot;
lutRoot(x) = lut(x); // to convert from Buffer
outputY(x, y) = autoLevelsPrecalcLut(inputY, lutRoot)(x, y);
And here lut is implicitly passed into extern and filled there. But it doesn't work, as well as other variants which ignore the modification of 'output'... or maybe this whole approach is wrong?
Any suggestions? Thanks
EDIT 3:
Solved task avoiding extern calls, replacing while cycles with argmin and RDom combo, but original question about extern still remains
That should work (or fail with a linker error if it wasn't going to). It's possible the Halide pipeline doesn't think it needs to call your extern function. E.g. does something use the result?
Alternatively, try stderr instead, just in case it's an output stream buffering issue. That extern function definition is likely to cause Halide to error out (because it doesn't reply to the bounds inference query), and erroring out calls abort by default, which would swallow things printed to stdout.

Can evaluation of functions happen during compile time?

Consider the below function,
public static int foo(int x){
return x + 5;
}
Now, let us call it,
int in = /*Input taken from the user*/;
int x = foo(10); // ... (1)
int y = foo(in); // ... (2)
Here, can the compiler change
int x = foo(10); // ... (1)
to
int x = 15; // ... (1)
by evaluating the function call during compile time since the input to the function is available during compile time ?
I understand this is not possible during the call marked (2) because the input is available only during run time.
I do not want to know a way of doing it in any specific language. I would like to know why this can or can not be a feature of a compiler itself.
C++ does have a method for this:
Have a read up on the 'constexpr' keyword in C++11, it allows compile time evaluation of functions.
They have a limitation: the function must be a return statement (not multiple lines of code), but can call other constexpr functions (C++14 does not have this limitation AFAIK).
static constexpr int foo(int x){
return x + 5;
}
EDIT:
Why a compiler might not evaluate a function (just my guess):
It might not be appropriate to remove a function by evaluating it without being told.
The function could be used in different compilation units, and with static/dynamic inputs: thus evaluating it in some circumstances and adding a call in other places.
This use would provide inconsistent execution times (especially on a deterministic platform like AVR) where timing may be important, or at least need to be predictable.
Also interrupts (and how the compiler interacts with them) may come into play here.
EDIT:
constexpr is actually stronger -- it requires that the compiler do this. The compiler is free to fold away functions without constexpr, but the programmer can't rely on it doing so.
Can you give an example in the case where the user would have benefited from this but the compiler chose not to do it ?
inline functions may, or may not resolve to constant expressions which could be optimized into the end result.
However, a constexpr guarantees it. An inline function cannot be used as a compile time constant whereas constexpr can allow you to formulate compile time functions and more so, objects.
A basic example where constexpr makes a guarantee that inline cannot.
constexpr int foo( int a, int b, int c ){
return a+b+c;
}
int array[ foo(1, 2, 3) ];
And the same as a simple object.
struct Foo{
constexpr Foo( int a, int b, int c ) : val(a+b+c){}
int val;
};
constexpr Foo foo( 1,2,4 );
int array[ foo.val ];
Unless foo.val is a compile time constant, the code above will not compile.
Even as just a function, an inline function has no guarantee. And the linker can also do inlining over multiple compilation units, after the syntax has been compiled (array bounds checked for integer constants).
This is kind of like meta-programming, but without the templates. Of course these examples do not do the topic justice, however very complex solutions would benefit from the ability to use objects and functional programming to achieve a result.
Yes, evaluation can happen during compile time. This comes under the heading of constant folding and function inlining, both of which are common optimizations for optimizing compilers.
Many languages do not have strong distinction between "compile time" and "run time", but the general rule is that the language defines an "execution model" which defines the behavior of any particular program with any particular input (or specifies that it is undefined). The compiler must produce an executable that can read any input and produce the corresponding output as defined by the execution model. What happens inside the executable doesn't matter -- as long as the externally viewed behavior is correct.
Here "input", "output" and "behavior" includes all possible interactions with the environment that are defined in the execution model, including timing effects.

Resources