Are there any restrictions with LUT: unbounded way in dimension

Are there any restrictions with LUT: unbounded way in dimension - halide

When trying to run the sample code below (similar to a look up table), it always generates the following error message: "The pure definition of Function 'out' calls function 'color' in an unbounded way in dimension 0".
RDom r(0, 10, 0, 10);
Func label, color, out;
Var x,y,c;
label(x,y) = 0;
label(r.x,r.y) = 1;
color(c) = 0;
color(label(r.x,r.y)) = 255;
out(x,y) = color(label(x,y));
out.realize(10,10);
Before calling realize, I have tried to statically set bound, like below, without success.
color.bound(c,0,10);
label.bound(x,0,10).bound(y,0,10);
out.bound(x,0,10).bound(y,0,10);
I also looked at the histogram examples, but they are a bit different.
Is this some kind of restrictions in Halide?

Halide prevents any out of bounds access (and decides what to compute) by analyzing the range of the values you pass as arguments to a Func. If those values are unbounded, it can't do that. The way to make them bounded is with clamp:
out(x, y) = color(clamp(label(x, y), 0, 9));
In this case, the reason it's unbounded is that label has an update definition, which makes the analysis give up. If you wrote label like this instead:
label(x, y) = select(x >= 0 && x < 10 && y >= 0 && y < 10, 1, 0);
Then you wouldn't need the clamp.

Related

Non-associative RDOM parallellization in Halide

I am trying to write a decoder for GPU. My encoding scheme has data dependencies between lines. So when decoding columns of data each column depends on the previous. I want to parallellize the internal computation of each column, but execute each column one-by-one and sequentially, but I am having trouble getting this correctly.
Below I have modeled a toy example to show the problem:
Func f;
Var x,y;
RDom r(1,3,1,3); // goes from (1,1) to (4,4)
f(x,y) = 0;
f(0,y) = y;
Expr p_1 = f(r.x-1,r.y);
Expr p_2 = f(r.x-1,r.y-1);
f(r.x,r.y) = p_1 + p_2;
Buffer<int32_t> output_2D = f.realize({4,4});
A visualization of this program can be seen here: Serial Computation Visualisation
This reduction should give the following array():
int expected_output[4][4] = {{0,0,0,0},
{1,1,1,1},
{2,3,4,5},
{3,5,8,12}};
And checking using Catch2 I can see that it actually calculates it correctly
for(int j = 0; j < output_2D.height(); j++){
for(int i = 0; i < output_2D.width(); i++){
CAPTURE(i,j);
REQUIRE(expected_output[j][i]==output_2D(i,j));
}
}
My task is to speed this computation up. Since column one depends on column zero I have to calculate each column in series. I can however, calculate all the values in the column in parallel. Please see Computation Steps Parallel and Desired Pipeline to see how I want Halide to compute the pipeline.
I tried doing this in halide using the f.update(1).allow_race_conditions().parallel(r.y); and this does almost what I want.
f(r.x,r.y) = p_1 + p_2;
f.update(1).allow_race_conditions().parallel(r.y);
f.trace_stores();
Buffer<int32_t> output_2D = f.realize({4,4});
For some reason however, it seems that parallel(y) executes the columns in seemingly random order.
It yields the following store_trace:
Init Image:
Store f29.0(0, 0) = 0
Store f29.0(1, 0) = 0
....
Store f29.0(3, 3) = 0
Init first row:
Store f29.0(0, 0) = 0
Store f29.0(1, 0) = 1
Store f29.0(2, 0) = 2
Store f29.0(3, 0) = 3
Start Parallel Computation:
Store f29.0(1, 1) = 1 // First parallel column
Store f29.0(2, 1) = 1
Store f29.0(3, 1) = 1
Store f29.0(1, 3) = 5 // Second parallel column: THIS IS MY PROBLEM
Store f29.0(2, 3) = 5 // This should be column 2 not column 3.
Store f29.0(3, 3) = 5
Store f29.0(1, 2) = 3
Store f29.0(2, 2) = 4
Store f29.0(3, 2) = 5
A visualization of this pattern can be seen here in this figure: Current Pipeline.
I know that I explicitly enabling the race_conditions so I must be doing something wrong, but I dont know what is the right way to do this and this is the closest I got. I could vectorize() with respect to y and that gives the correct evaluation, but I want to use the parallel() block to gain greater speedup for larger matrixes/images. RFactor might be a solution as my problem should be associative in the y direction, but it might not work as it is non-associative in the x-direction(each column depends on the previous) Does anyone know how to be serial in x and parallel in y when using RDoms?

rfactor schedule for descriptor matching

I'm trying to use Halide for brute-force descriptor (e.g SIFT) matching. I'd like to try rfactor in the schedule, but I can't seem to get the associativity prover to oblige. So far I have the following:
Var c("c"), i("i");
Func diff("diff"), diffSq("diffSq"), dotp("dotp"), out("out"),
inp1("inp1"), inp2("inp2"), minVal("minVal");
inp1(c,x) = input1(c,x);
inp2(c,y) = input2(c,y);
diff(x,y,c) = inp1(c, x) - inp2(c, y);
diffSq(x,y,c) = diff(x,y,c) * diff(x,y,c);
RDom rc(0,128);
dotp(x, y) = 0.f;
dotp(x, y) += diffSq(x, y, rc);
// Argmin, see https://github.com/halide/Halide/blob/master/test/correctness/rfactor.cpp#L804
RDom ry(0, input2.height(), "ry");
minVal(x) = {-1, std::numeric_limits<float>::max()};
minVal(x) = {
select(minVal(x)[1] < dotp(x, ry)
,minVal(x)[0]
,ry),
min(minVal(x)[1], dotp(x, ry))
};
out(x) = minVal(x)[0];
// Schedule
RVar ryo("ryo"), ryi("ryi");
Var yy("yy");
Func intermediate("inter");
dotp.compute_root();
minVal.update(0).split(ry, ryo, ryi, 16);
//intermediate = minVal.update(0).rfactor(ryo, yy);
The last, uncommented line sadly fails with:
|| Failed to call rfactor() on minVal.update(0) since it can't prove associativity of the operator
Thanks for any pointers as to how I could resolve this!

Quick answer: only one order of the Tuple elements is matched. Flipping them should allow rfactor. There will be a more complete answer on the list and we'll look at generalizing the matcher. (Answering to make sure the SO side doesn't get forgotten.)

Least Squares Algorithm doesn't work

:) I'm trying to code a Least Squares algorithm and I've come up with this:
function [y] = ex1_Least_Squares(xValues,yValues,x) % a + b*x + c*x^2 = y
points = size(xValues,1);
A = ones(points,3);
b = zeros(points,1);
for i=1:points
A(i,1) = 1;
A(i,2) = xValues(i);
A(i,3) = xValues(i)^2;
b(i) = yValues(i);
end
constants = (A'*A)\(A'*b);
y = constants(1) + constants(2)*x + constants(3)*x^2;
When I use this matlab script for linear functions, it works fine I think. However, when I'm passing 12 points of the sin(x) function I get really bad results.
These are the points I pass to the function:
xValues = [ -180; -144; -108; -72; -36; 0; 36; 72; 108; 144; 160; 180];
yValues = [sind(-180); sind(-144); sind(-108); sind(-72); sind(-36); sind(0); sind(36); sind(72); sind(108); sind(144); sind(160); sind(180) ];
And the result is sin(165°) = 0.559935259380508, when it should be sin(165°) = 0.258819

There is no reason why fitting a parabola to a full period of a sinusoid should give good results. These two curves are unrelated.

MATLAB already contains a least square polynomial fitting function, polyfit and a complementary function, polyval. Although you are probably supposed to write your own, trying out something like the following will be educational:
xValues = [ -180; -144; -108; -72; -36; 0; 36; 72; 108; 144; 160; 180];
% you may want to experiment with different ranges of xValues
yValues = sind(xValues);
% try this with different values of n, say 2, 3, and 4
p = polyfit(xValues,yValues,n);
x = -180:36:180;
y = polyval(p,x);
plot(xValues,yValues);
hold on
plot(x,y,'r');
Also, more generically, you should avoid using loops as you have in your code. This should be equivalent:
points = size(xValues,1);
A = ones(points,3);
A(:,2) = xValues;
A(:,3) = xValues.^2; % .^ and ^ are different
The part of the loop involving b is equivalent to doing b = yValues; either name the incoming variable b or just use the variable yValues, there's no need to make a copy of it.

Memory and excecution speed in Matlab

I am trying to create random lines and select some of them, which are really rare. My code is rather simple, but to get something that I can use I need to create very large vectors(i.e.: <100000000 x 1, tracks variable in my code). Is there any way to be able to creater larger vectors and to reduce the time needed for all those calculations?
My code is
%Initial line values
tracks=input('Give me the number of muon tracks: ');
width=1e-4;
height=2e-4;
Ystart=15.*ones(tracks,1);
Xstart=-40+80.*rand(tracks,1);
%Xend=-40+80.*rand(tracks,1);
Xend=laprnd(tracks,1,Xstart,15);
X=[Xstart';Xend'];
Y=[Ystart';zeros(1,tracks)];
b=(Ystart.*Xend)./(Xend-Xstart);
hot=0;
cold=0;
for i=1:tracks
if ((Xend(i,1)<width/2 && Xend(i,1)>-width/2)||(b(i,1)<height && b(i,1)>0))
plot(X(:, i),Y(:, i),'r');%the chosen ones!
hold all
hot=hot+1;
else
%plot(X(:, i),Y(:, i),'b');%the rest of them
%hold all
cold=cold+1;
end
end
I am also using and calling a Laplace distribution generator made my Elvis Chen which can be found here
function y = laprnd(m, n, mu, sigma)
%LAPRND generate i.i.d. laplacian random number drawn from laplacian distribution
% with mean mu and standard deviation sigma.
% mu : mean
% sigma : standard deviation
% [m, n] : the dimension of y.
% Default mu = 0, sigma = 1.
% For more information, refer to
% http://en.wikipedia.org./wiki/Laplace_distribution
% Author : Elvis Chen (bee33#sjtu.edu.cn)
% Date : 01/19/07
%Check inputs
if nargin < 2
error('At least two inputs are required');
end
if nargin == 2
mu = 0; sigma = 1;
end
if nargin == 3
sigma = 1;
end
% Generate Laplacian noise
u = rand(m, n)-0.5;
b = sigma / sqrt(2);
y = mu - b * sign(u).* log(1- 2* abs(u));
The result plot is

As you indicate, your problem is two-fold. On the one hand, you have memory issues because you need to do so many trials. On the other hand, you have performance issues, because you have to process all those trials.
Solutions to each issue often have a negative impact on the other issue. IMHO, the best approach would be to find a compromise.
More trials are only possible of you get rid of those gargantuan arrays that are required for vectorization, and use a different strategy to do the loop. I will give priority to the possibility of using more trials, possibly at the cost of optimal performance.
When I execute your code as-is in the Matlab profiler, it immediately shows that the initial memory allocation for all your variables takes a lot of time. It also shows that the plot and hold all commands are the most time-consuming lines of them all. Some more trial-and-error shows that there is a disappointingly low maximum value for the trials you can do before OUT OF MEMORY errors start appearing.
The loop can be accelerated tremendously if you know a few things about its limitations in Matlab. In older versions of Matlab, it used to be true that loops should be avoided completely in favor of 'vectorized' code. In recent versions (I believe R2008a and up), the Mathworks introduced a piece of technology called the JIT accelerator (Just-in-Time compiler) which translates M-code into machine language on the fly during execution. Simply put, the JIT accelerator allows your code to bypass Matlab's interpreter and talk much more directly with the underlying hardware, which can save a lot of time.
The advice you'll hear a lot that loops should be avoided in Matlab, is no longer generally true. While vectorization still has its value, any procedure of sizable complexity that is implemented using only vectorized code is often illegible, hard to understand, hard to change and hard to upkeep. An implementation of the same procedure that uses loops, often has none of these drawbacks, and moreover, it will quite often be faster and require less memory.
Unfortunately, the JIT accelerator has a few nasty (and IMHO, unnecessary) limitations that you'll have to learn about.
One such thing is plot; it's generally a better idea to let a loop do nothing other than collect and manipulate data, and delay any plotting commands etc. until after the loop.
Another such thing is hold; the hold function is not a Matlab built-in function, meaning, it is implemented in M-language. Matlab's JIT accelerator is not able to accelerate non-builtin functions when used in a loop, meaning, your entire loop will run at Matlab's interpretation speed, rather than machine-language speed! Therefore, also delay this command until after the loop :)
Now, in case you're wondering, this last step can make a HUGE difference -- I know of one case where copy-pasting a function body into the upper-level loop caused a 1200x performance improvement. Days of execution time had been reduced to minutes!).
There is actually another minor issue in your loop (which is really small, and rather inconvenient, I will immediately agree with) -- the name of the loop variable should not be i. The name i is the name of the imaginary unit in Matlab, and the name resolution will also unnecessarily consume time on each iteration. It's small, but non-negligible.
Now, considering all this, I've come to the following implementation:
function [hot, cold, h] = MuonTracks(tracks)
% NOTE: no variables larger than 1x1 are initialized
width = 1e-4;
height = 2e-4;
% constant used for Laplacian noise distribution
bL = 15 / sqrt(2);
% Loop through all tracks
X = [];
hot = 0;
ii = 0;
while ii <= tracks
ii = ii + 1;
% Note that I've inlined (== copy-pasted) the original laprnd()
% function call. This was necessary to work around limitations
% in loops in Matlab, and prevent the nececessity of those HUGE
% variables.
%
% Of course, you can still easily generalize all of this:
% the new data
u = rand-0.5;
Ystart = 15;
Xstart = 800*rand-400;
Xend = Xstart - bL*sign(u)*log(1-2*abs(u));
b = (Ystart*Xend)/(Xend-Xstart);
% the test
if ((b < height && b > 0)) ||...
(Xend < width/2 && Xend > -width/2)
hot = hot+1;
% growing an array is perfectly fine when the chances of it
% happening are so slim
X = [X [Xstart; Xend]]; %#ok
end
end
% This is trivial to do here, and prevents an 'else' in the loop
cold = tracks - hot;
% Now plot the chosen ones
h = figure;
hold all
Y = repmat([15;0], 1, size(X,2));
plot(X, Y, 'r');
end
With this implementation, I can do this:
>> tic, MuonTracks(1e8); toc
Elapsed time is 24.738725 seconds.
with a completely negligible memory footprint.
The profiler now also shows a nice and even distribution of effort along the code; no lines that really stand out because of their memory use or performance.
It's possibly not the fastest possible implementation (if anyone sees obvious improvements, please, feel free to edit them in). But, if you're willing to wait, you'll be able to do MuonTracks(1e23) (or higher :)
I've also done an implementation in C, which can be compiled into a Matlab MEX file:
/* DoMuonCounting.c */
#include <math.h>
#include <matrix.h>
#include <mex.h>
#include <time.h>
#include <stdlib.h>
void CountMuons(
unsigned long long tracks,
unsigned long long *hot, unsigned long long *cold, double *Xout);
/* simple little helper functions */
double sign(double x) { return (x>0)-(x<0); }
double rand_double() { return (double)rand()/(double)RAND_MAX; }
/* the gateway function */
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
int
dims[] = {1,1};
const mxArray
/* Output arguments */
*hot_out = plhs[0] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
*cold_out = plhs[1] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
*X_out = plhs[2] = mxCreateDoubleMatrix(2,10000, mxREAL);
const unsigned long long
tracks = (const unsigned long long)mxGetPr(prhs[0])[0];
unsigned long long
*hot = (unsigned long long*)mxGetPr(hot_out),
*cold = (unsigned long long*)mxGetPr(cold_out);
double
*Xout = mxGetPr(X_out);
/* call the actual function, and return */
CountMuons(tracks, hot,cold, Xout);
}
// The actual muon counting
void CountMuons(
unsigned long long tracks,
unsigned long long *hot, unsigned long long *cold, double *Xout)
{
const double
width = 1.0e-4,
height = 2.0e-4,
bL = 15.0/sqrt(2.0),
Ystart = 15.0;
double
Xstart,
Xend,
u,
b;
unsigned long long
i = 0ul;
*hot = 0ul;
*cold = tracks;
/* seed the RNG */
srand((unsigned)time(NULL));
/* aaaand start! */
while (i++ < tracks)
{
u = rand_double() - 0.5;
Xstart = 800.0*rand_double() - 400.0;
Xend = Xstart - bL*sign(u)*log(1.0-2.0*fabs(u));
b = (Ystart*Xend)/(Xend-Xstart);
if ((b < height && b > 0.0) || (Xend < width/2.0 && Xend > -width/2.0))
{
Xout[0 + *hot*2] = Xstart;
Xout[1 + *hot*2] = Xend;
++(*hot);
--(*cold);
}
}
}
compile in Matlab with
mex DoMuonCounting.c
(after having run mex setup :) and then use it in conjunction with a small M-wrapper like this:
function [hot,cold, h] = MuonTrack2(tracks)
% call the MEX function
[hot,cold, Xtmp] = DoMuonCounting(tracks);
% process outputs, and generate plots
hot = uint32(hot); % circumvents limitations in 32-bit matlab
X = Xtmp(:,1:hot);
clear Xtmp
h = NaN;
if ~isempty(X)
h = figure;
hold all
Y = repmat([15;0], 1, hot);
plot(X, Y, 'r');
end
end
which allows me to do
>> tic, MuonTrack2(1e8); toc
Elapsed time is 14.496355 seconds.
Note that the memory footprint of the MEX version is slightly larger, but I think that's nothing to worry about.
The only flaw I see is the fixed maximum number of Muon counts (hard-coded as 10000 as the initial array size of Xout; needed because there are no dynamically growing arrays in standard C)...if you're worried this limit could be broken, simply increase it, change it to be equal to a fraction of tracks, or do some smarter (but more painful) dynamic array-growing tricks.

In Matlab, it is sometimes faster to vectorize rather than use a for loop. For example, this expression:
(Xend(i,1) < width/2 && Xend(i,1) > -width/2) || (b(i,1) < height && b(i,1) > 0)
which is defined for each value of i, can be rewritten in a vectorised manner like this:
isChosen = (Xend(:,1) < width/2 & Xend(:,1) > -width/2) | (b(:,1) < height & b(:,1)>0)
Expessions like Xend(:,1) will give you a column vector, so Xend(:,1) < width/2 will give you a column vector of boolean values. Note then that I have used & rather than && - this is because & performs an element-wise logical AND, unlike && which only works on scalar values. In this way you can build the entire expression, such that the variable isChosen holds a column vector of boolean values, one for each row of your Xend/b vectors.
Getting counts is now as simple as this:
hot = sum(isChosen);
since true is represented by 1. And:
cold = sum(~isChosen);
Finally, you can get the data points by using the boolean vector to select rows:
plot(X(:, isChosen),Y(:, isChosen),'r'); % Plot chosen values
hold all;
plot(X(:, ~isChosen),Y(:, ~isChosen),'b'); % Plot unchosen values
EDIT: The code should look like this:
isChosen = (Xend(:,1) < width/2 & Xend(:,1) > -width/2) | (b(:,1) < height & b(:,1)>0);
hot = sum(isChosen);
cold = sum(~isChosen);
plot(X(:, isChosen),Y(:, isChosen),'r'); % Plot chosen values

NIntegrate evaluates a clearly non-zero integral to be zero. The integrand is a set of data points morphed into a distribution

I have a list of data points that are to represent a probability distribution. I need to integrate over this distribution. However since I didn't have a function and I only had a set of data points, I came up with the following code to represent the probability distribution:
dList1 = Import["Z-1.txt", "Table"];
dList2 = Import["Z_over-1.txt", "Table"];
dDist[X_,sym_] := (
dList = 0;
If[sym,
dList = dList1;
,
dList = dList2;
];
val = 0;
If[Abs[X] < Pi,
i = 2;
While[dList[[i]][[1]] < X, i++];
width = dList[[i]][[1]] - dList[[i-1]][[1]];
difX = dList[[i]][[1]] - X;
difY = dList[[i]][[2]] - dList[[i-1]][[2]];
val = dList[[i-1]][[2]] + (1-(difX/width)) difY;
];
Return[val];
);
where the set of data points are in the text files.
Performing the following command:
Plot[dDist[x, True], {x, -1, 1}]
gives this:
Whereas, performing this:
NIntegrate[dDist[x, True], {x, -1, 1}]
evaluates to zero, along with this warning:
I have tried increasing MinRecursion to no avail. I'm not sure what to do and would be open to any suggestions, including modifying the dDist function.

Not having your data to play with, I'd suggest using the table(s) to create Piecewise functions separated at the disontinuities. NIntegrate should handle those with no problem.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Are there any restrictions with LUT: unbounded way in dimension - halide

Related

Non-associative RDOM parallellization in Halide

rfactor schedule for descriptor matching

Least Squares Algorithm doesn't work

Memory and excecution speed in Matlab

NIntegrate evaluates a clearly non-zero integral to be zero. The integrand is a set of data points morphed into a distribution

Categories

Resources