Two float[] outputs in one kernel pass (Sobel -> Magnitude and Direction) - allocation

I wrote the following rs code in order to calculate the magnitude and the direction within the same kernel as the sobel gradients.
#pragma version(1)
#pragma rs java_package_name(
#pragma rs_fp_relaxed
rs_allocation bmpAllocIn, direction;
int32_t width;
int32_t height;
// Sobel, Magnitude und Direction
float __attribute__((kernel)) sobel_XY(uint32_t x, uint32_t y) {
float sobX=0, sobY=0, magn=0;
// leave a border of 1 pixel
if (x>0 && y>0 && x<(width-1) && y<(height-1)){
uchar4 c11=rsGetElementAt_uchar4(bmpAllocIn, x-1, y-1); uchar4 c12=rsGetElementAt_uchar4(bmpAllocIn, x-1, y);uchar4 c13=rsGetElementAt_uchar4(bmpAllocIn, x-1, y+1);
uchar4 c21=rsGetElementAt_uchar4(bmpAllocIn, x, y-1);uchar4 c23=rsGetElementAt_uchar4(bmpAllocIn, x, y+1);
uchar4 c31=rsGetElementAt_uchar4(bmpAllocIn, x+1, y-1);uchar4 c32=rsGetElementAt_uchar4(bmpAllocIn, x+1, y);uchar4 c33=rsGetElementAt_uchar4(bmpAllocIn, x+1, y+1);
sobX= (float) c11.r-c31.r + 2*(c12.r-c32.r) + c13.r-c33.r;
sobY= (float) c11.r-c13.r + 2*(c21.r-c23.r) + c31.r-c33.r;
float d = atan2(sobY, sobX);
rsSetElementAt_float(direction, d, x, y);
magn= hypot(sobX, sobY);
rsSetElementAt_float(direction, 0, x, y);
return magn;
And the Java part:
float[] gm = new float[width*height]; // gradient magnitude
float[] gd = new float[width*height]; // gradient direction
ScriptC_sobel script;
script=new ScriptC_sobel(rs);
script.set_bmpAllocIn(Allocation.createFromBitmap(rs, bmpGray));
// dirAllocation: reference to the global variable "direction" in rs script. This
// dirAllocation is actually the second output of the kernel. It will be "filled" by
// the rsSetElementAt_float() method that include a reference to the current
// element (x,y) during the passage of the kernel.
Type.Builder TypeDir = new Type.Builder(rs, Element.F32(rs));
Allocation dirAllocation = Allocation.createTyped(rs, TypeDir.create());
// outAllocation: the kernel will slide along this global float Variable, which is
// "formally" the output (in principle the roles of the outAllocation (magnitude) and the
// second global variable direction (dirAllocation)could have been switched, the kernel
// just needs at least one in- or out-Allocation to "slide" along.)
Type.Builder TypeOut = new Type.Builder(rs, Element.F32(rs));
Allocation outAllocation = Allocation.createTyped(rs, TypeOut.create());
script.forEach_sobel_XY(outAllocation); //start kernel
// here comes the problem
outAllocation.copyTo(gm) ;
In a nutshell: this code works for my older Galaxy Tab2 (API17) but it creates a crash (Fatal signal 7 (SIGBUS), code 2, fault addr 0x9e6d4000 in tid 6385) with my Galaxy S5 (API 21). The strange thing is that when I use a simpler Kernel that just calculates SobelX or SobelY gradients in the very same way (except the 2nd allocation, here for the direction), it works also on the S5. Thus, the Problem cannot be some compatibility issue. Also, as I said, the kernel itself passes without problems (I can log the Magnitude and direction values) but it struggles with the above .copyTo Statements. As you can see the gm and gd floats have the same dimensions (width*height) as all other allocations used by the kernel. Any idea what the Problem could be? Or is there an alternative, more robust way to do the whole Story?


How to make this pattern to expand and shrink back

i have a task to make a pattern of circles and squares as described on photo, and i need to animate it so that all objects smoothly increase to four times the size and then shrink back to their original size and this is repeated. i tried but i cant understand problem
void pattern(int a, int b)
boolean isShrinking = false;
for(int x = 0; x <= width; x += a){
for(int y = 0; y <= height; y += a){
if (isShrinking){a -= b;}
else {a += b;}
if (a == 50 || a == 200){
isShrinking = !isShrinking ; }
void draw()
this is what pattern need to look like
Great that you've posted your attempt.
From what you presented I can't understand the problem either. If this is an assignment, perhaps try to get more clarifications ?
If you comment you the isShrinking part of the code indeed you have an drawing similar to image you posted.
animate it so that all objects smoothly increase to four times the size and then shrink back to their original size and this is repeated
Does that simply mean scaling the whole pattern ?
If so, you can make use of the sine function (sin()) and the map() function to achieve that:
sin(), as the reference mentions, returns a value between -1 and 1 when you pass it an angle between 0 and 2 * PI (because in Processing trig. functions use radians not degrees for angles)
You can use frameCount divided by a fractional value to mimic an even increasing angle. (Even if you go around the circle multiple times (angle > 2 * PI), sin() will still return a value between -1 and 1)
map() takes a single value from one number range and maps it to another. (In your case from sin()'s result (-1,1) to the scale range (1,4)
Here's a tweaked version of your code with the above notes:
void setup()
size(500, 500, FX2D);
void pattern(int a)
for (int x = 0; x <= width; x += a) {
for (int y = 0; y <= height; y += a) {
ellipse(x, y, a, a);
rect(x, y, a, a);
ellipse(x+25, y+25, a/2, a/2);
void draw()
// clear frame (previous drawings)
// use the frame number as if it's an angle
float angleInRadians = frameCount * .01;
// map the sin of the frame based angle to the scale range
float sinAsScale = map(sin(angleInRadians), -1, 1, 1, 4);
// apply the scale
// render the pattern (at current scale)
(I've chosen the FX2D renderer because it's smoother in this case.
Additionally I advise in the future formatting the code. It makes it so much easier to read and it barely takes any effort (press Ctrl+T). On the long run you'll read code more than you'll write it, especially on large programs and heaving code that's easy to read will save you plenty of time and potentially headaches.)

Confusion about zFar and zNear plane offsets using glm::perspective

I have been using glm to help build a software rasterizer for self education. In my camera class I am using glm::lookat() to create my view matrix and glm::perspective() to create my perspective matrix.
I seem to be getting what I expect for my left, right top and bottom clipping planes. However, I seem to be either doing something wrong for my near/far planes of there is an error in my understanding. I have reached a point in which my "google-fu" has failed me.
Operating under the assumption that I am correctly extracting clip planes from my glm::perspective matrix, and using the general plane equation:
aX+bY+cZ+d = 0
I am getting strange d or "offset" values for my zNear and zFar planes.
It is my understanding that the d value is the value of which I would be shifting/translatin the point P0 of a plane along the normal vector.
They are 0.200200200 and -0.200200200 respectively. However, my normals are correct orientated at +1.0f and -1.f along the z-axis as expected for a plane perpendicular to my z basis vector.
So when testing a point such as the (0, 0, -5) world space against these planes, it is transformed by my view matrix to:
(0, 0, 5.81181192)
so testing it against these plane in a clip chain, said example vertex would be culled.
Here is the start of a camera class establishing the relevant matrices:
static constexpr glm::vec3 UPvec(0.f, 1.f, 0.f);
static constexpr auto zFar = 100.f;
static constexpr auto zNear = 0.1f;
Camera::Camera(glm::vec3 eye, glm::vec3 center, float fovY, float w, float h) :
viewMatrix{ glm::lookAt(eye, center, UPvec) },
perspectiveMatrix{ glm::perspective(glm::radians<float>(fovY), w/h, zNear, zFar) },
frustumLeftPlane {setPlane(0, 1)},
frustumRighPlane {setPlane(0, 0)},
frustumBottomPlane {setPlane(1, 1)},
frustumTopPlane {setPlane(1, 0)},
frstumNearPlane {setPlane(2, 0)},
frustumFarPlane {setPlane(2, 1)},
The frustum objects are based off the following struct:
struct Plane
glm::vec4 normal;
float offset;
I have extracted the 6 clipping planes from the perspective matrix as below:
Plane Camera::setPlane(const int& row, const bool& sign)
float temp[4]{};
Plane plane{};
if (sign == 0)
for (int i = 0; i < 4; ++i)
temp[i] = perspectiveMatrix[i][3] + perspectiveMatrix[i][row];
for (int i = 0; i < 4; ++i)
temp[i] = perspectiveMatrix[i][3] - perspectiveMatrix[i][row];
plane.normal.x = temp[0];
plane.normal.y = temp[1];
plane.normal.z = temp[2];
plane.normal.w = 0.f;
plane.offset = temp[3];
plane.normal = glm::normalize(plane.normal);
return plane;
Any help would be appreciated, as now I am at a loss.
Many thanks.
The d parameter of a plane equation describes how much the plane is offset from the origin along the plane normal. This also takes into account the length of the normal.
One can't just normalize the normal without also adjusting the d parameter since normalizing changes the length of the normal. If you want to normalize a plane equation then you also have to apply the division step to the d coordinate:
float normalLength = sqrt(temp[0] * temp[0] + temp[1] * temp[1] + temp[2] * temp[2]);
plane.normal.x = temp[0] / normalLength;
plane.normal.y = temp[1] / normalLength;
plane.normal.z = temp[2] / normalLength;
plane.normal.w = 0.f;
plane.offset = temp[3] / normalLength;
Side note 1: Usually, one would store the offset of a plane equation in the w-coordinate of a vec4 instead of a separate variable. The reason is that the typical operation you perform with it is a point to plane distance check like dist = n * x - d (for a given point x, normal n, offset d, * is dot product), which can then be written as dist = [n, d] * [x, -1].
Side note 2: Most software and also hardware rasterizer perform clipping after the projection step since it's cheaper and easier to implement.

Processing: Efficiently create uniform grid

I'm trying to create a grid of an image (in the way one would tile a background with). Here's what I've been using:
PImage bgtile;
PGraphics bg;
int tilesize = 50;
void setup() {
int t = millis();
bgtile = loadImage("bgtile.png");
int bgw = ceil( ((float) width) / tilesize) + 1;
int bgh = ceil( ((float) height) / tilesize) + 1;
bg = createGraphics(bgw*tilesize,bgh*tilesize);
for(int i = 0; i < bgw; i++){
for(int j = 0; j < bgh; j++){
bg.image(bgtile, i*tilesize, j*tilesize, tilesize, tilesize);
print(millis() - t);
The timing code says that this takes about a quarter of a second, but by my count there's a full second once the window opens before anything shows up on screen (which should happen as soon as draw is first run). Is there a faster way to get this same effect? (I want to avoid rendering bgtile hundreds of times in the draw loop for obvious reasons)
One way could be to make use of the GPU and let OpenGL repeat a texture for you.
Processing makes it fairly easy to repeat a texture via textureWrap(REPEAT)
Instead of drawing an image you'd make your own quad shape and instead of calling vertex(x, y) for example, you'd call vertex(x, y, u, v); passing texture coordinates (more low level info on the OpenGL link above). The simple idea is x,y would control the geometry on screen and u,v would control how the texture is applied to the geometry.
Another thing you can control is textureMode() which allows you control how you specify the texture coordinates (U, V):
IMAGE mode is the default: you use pixel coordinates (based on the dimensions of the texture)
NORMAL mode uses values between 0.0 and 1.0 (also known as normalised values) where 1.0 means the maximum the texture can go (e.g. image width for U or image height for V) and you don't need to worry about knowing the texture image dimensions
Here's a basic example based on the textureMode() example above:
PImage img;
void setup() {
img = loadImage("");
// texture mode can be IMAGE (pixel dimensions) or NORMAL (0.0 to 1.0)
// normal means 1.0 is full width (for U) or height (for V) without having to know the image resolution
// this is what will make handle tiling for you
void draw() {
// drag mouse on X axis to change tiling
int tileRepeats = (int)map(constrain(mouseX,0,width), 0, width, 1, 100);
// draw a textured quad
// set the texture
// x , y , U , V
vertex(0 , 0 , 0 , 0);
vertex(width, 0 , tileRepeats, 0);
vertex(width, height, tileRepeats, tileRepeats);
vertex(0 , height, 0 , tileRepeats);
Drag the mouse on the Y axis to control the number of repetitions.
In this simple example both vertex coordinates and texture coordinates are going clockwise (top left, top right, bottom right, bottom left order).
There are probably other ways to achieve the same result: using a PShader comes to mind.
Your approach caching the tiles in setup is ok.
Even flattening your nested loop into a single loop at best may only shave a few milliseconds off, but nothing substantial.
If you tried to cache my snippet above it would make a minimal difference.
In this particular case, because of the back and forth between Java/OpenGL (via JOGL), as far as I can tell using VisualVM, it looks like there's not a lot of room for improvement since simply swapping buffers takes so long (e.g. bg.image()):
An easy way to do this would be to use processing's built in get(); which saves a PImage of the coordinates you pass, for example: PImage pic = get(0, 0, width, height); will capture a "screenshot" of your entire window. So, you can create the image like you already are, and then take a screenshot and display that screenshot.
PImage bgtile;
PGraphics bg;
PImage screenGrab;
int tilesize = 50;
void setup() {
bgtile = loadImage("bgtile.png");
int bgw = ceil(((float) width) / tilesize) + 1;
int bgh = ceil(((float) height) / tilesize) + 1;
bg = createGraphics(bgw * tilesize, bgh * tilesize);
for (int i = 0; i < bgw; i++) {
for (int j = 0; j < bgh; j++) {
bg.image(bgtile, i * tilesize, j * tilesize, tilesize, tilesize);
screenGrab = get(0, 0, width, height);
void draw() {
image(screenGrab, 0, 0);
This will still take a little bit to generate the image, but once it does, there is no need to use the for loops again unless you change the tilesize.
#George Profenza's answer looks more efficient than my solution, but mine may take a little less modification to the code you already have.

Calculating and storing pixelated ellipse

I was wondering if it is possible to create a function (arbitrary of language) that has as input a width and height.
This function would then calculate the biggest ellipse that would fit inside of the dimensions that it is given, and store this in a matrix such as these two examples;
In the left example, the width is 14 and height is 27, where the white part is the ellipse.
In the right example, the width is 38 and height is 21, where, once again, the white part is the ellipse.
Of course, the black and white parts can be seen as true/false values if they are part of the ellipse or not.
Yes it is possible. The process is called ellipse rasterization. Here few methods to do so:
let our image has xs,ys resolution so center (x0,y0) and semiaxises a,b are:
a =x0-1
b =y0-1
using ellipse equation
so 2 nested for loops + if condition deciding if you are inside or outside ellipse.
for (y=0;y<ys;y++)
for (x=0;x<xs;x++)
if (((x-x0)*(x-x0)/(a*a))+((y-y0)*(y-y0)/(b*b))<=1.0) pixel[y][x]=color_inside;
else pixel[y][x]=color_outside;
You can optimize this quite a lot by pre-computing the parts of the equations only if thy change so some are computed just once others on each x iteration and the rest on each y iteration. Also is better to multiply instead of dividing.
using parametric ellipse equation
x(t) = x0 + a*cos(t)
y(t) = y0 + b*sin(t)
t = <0,2.0*M_PI> // for whole ellipse
so one for loop creating quadrant coordinates and filling lines inside and outside for the 3 mirrors of the quadrant using only horizontal or only vertical lines. However this approach need a buffer to store the circumference points of one quadrant.
Using Bresenham ellipse algorithm
Using any Circle algorithm and stretch to ellipse
so simply use square area of size of the lesser resolution from xs,ys render circle and than stretch back to xs,ys. If you do not stretch during rasterization than you might create artifacts. In such case is better to use the bigger resolution and stretch down but that is slower of coarse.
Drawing an ellipse and storing it in a matrix can be accomplished with two different methods: either Rasterization (the recommended way) or pixel-by-pixel rendering. According to #Spektre's comment, I wonder if both methods are called "rasterization" since they both render the ellipse to raster image. Anyway, I'll explain how to use both methods in C++ to draw an ellipse and store it in your matrix.
Note: Here I'll assume that the origin of your matrix matrix[0][0] refers to the upper-left corner of the image. So points on the matrix are described by x- and y-coordinate pairs, such that x-coordinates increase to the right; y-coordinates increase from top to bottom.
Pixel-by-pixel ellipse rendering
With this method, you loop over all the pixels in your matrix to determine whether each pixel is inside or outside of the ellipse. If the pixel is inside, you make it white, otherwise, you make it black.
In the following example code, the isPointOnEllipse function determines the status of a point relative to the ellipse. It takes the coordinates of the point, coordinates of the center of the ellipse, and the lengths of semi-major and semi-minor axes as parameters. It then returns either one of the values PS_OUTSIDE, PS_ONPERIM, or PS_INSIDE, which indicate that the point lies outside of the ellipse, the point lies exactly on the ellipse's perimeter, or the point lies inside of the ellipse, respectively.
Obviously, if the point status is PS_ONPERIM, then the point is also part of the ellipse and must be made white; because the ellipse's outline must be colored in addition to its inner area.
You must call ellipseInMatrixPBP function to draw an ellipse, passing it a pointer to your matrix, and the width and height of your matrix. This function loops through every pixel in your matrix, and then calls isPointOnEllipse for every pixel to see if it is inside or outside of the ellipse. Finally, it modifies the pixel accordingly.
#include <math.h>
// Indicates the point lies outside of the ellipse.
#define PS_OUTSIDE (0)
// Indicates the point lies exactly on the perimeter of the ellipse.
#define PS_ONPERIM (1)
// Indicates the point lies inside of the ellipse.
#define PS_INSIDE (2)
short isPointOnEllipse(int cx, int cy, int rx, int ry, int x, int y)
double m = (x - cx) * ((double) ry) / ((double) rx);
double n = y - cy;
double h = sqrt(m * m + n * n);
if (h == ry)
return PS_ONPERIM;
else if (h < ry)
return PS_INSIDE;
return PS_OUTSIDE;
void ellipseInMatrixPBP(bool **matrix, int width, int height)
// So the ellipse shall be stretched to the whole matrix
// with a one-pixel margin.
int cx = width / 2;
int cy = height / 2;
int rx = cx - 1;
int ry = cy - 1;
int x, y;
short pointStatus;
// Loop through all the pixels in the matrix.
for (x = 0;x < width;x++)
for (y = 0;y < height;y++)
pointStatus = isPointOnEllipse(cx, cy, rx, ry, x, y);
// If the current pixel is outside of the ellipse,
// make it black (false).
// Else if the pixel is inside of the ellipse or on its perimeter,
// make it white (true).
if (pointStatus == PS_OUTSIDE)
matrix[x][y] = false;
matrix[x][y] = true;
Ellipse rasterization
If the pixel-by-pixel approach to rendering is too slow, then use the rasterization method. Here you determine which pixels in the matrix the ellipse affects, and then you modify those pixels (e.g. you turn them white). Unlike pixel-by-pixel rendering, rasterization does not have to pass through the pixels that are outside of the ellipse shape, which is why this approach is so faster.
To rasterize the ellipse, it is recommended that you use the so-called Mid-point Ellipse algorithm, which is an extended form of Bresenham's circle algorithm.
However, I've discovered an ellipse-drawing algorithm which is probably sophisticated enough (except for its performance) to compete with Bresenham's! So I'll post the function that you want - written in C++.
The following code defines a function named ellipseInMatrix that draws an ellipse with a one-pixel stroke, but does not fill that ellipse. You need to pass this function a pointer to the matrix that you have already allocated and initialized to false values, plus the dimensions of the matrix as integers. Note that ellipseInMatrix internally calls the rasterizeEllipse function which performs the main rasterizing operation. Whenever this function finds a point of the ellipse, it sets the corresponding pixel in the matrix to true, which causes the pixel to turn white.
#define pi (2 * acos(0.0))
#define coord_nil (-1)
struct point
int x;
int y;
double getEllipsePerimeter(int rx, int ry)
return pi * sqrt(2 * (rx * rx + ry * ry));
void getPointOnEllipse(int cx, int cy, int rx, int ry, double d, struct point *pp)
double theta = d * sqrt(2.0 / (rx * rx + ry * ry));
// double theta = 2 * pi * d / getEllipsePerimeter(rx, ry);
pp->x = (int) floor(cx + cos(theta) * rx);
pp->y = (int) floor(cy - sin(theta) * ry);
void rasterizeEllipse(bool **matrix, int cx, int cy, int rx, int ry)
struct point currentPoint, midPoint;
struct point previousPoint = {coord_nil, coord_nil};
double perimeter = floor(getEllipsePerimeter(rx, ry));
double i;
// Loop over the perimeter of the ellipse to determine all points on the ellipse path.
for (i = 0.0;i < perimeter;i++)
// Find the current point and determine its coordinates.
getPointOnEllipse(cx, cy, rx, ry, i, &currentPoint);
// So color the current point.
matrix[currentPoint.x][currentPoint.y] = true;
// So check if the previous point exists. Please note that if the current
// point is the first point (i = 0), then there will be no previous point.
if (previousPoint.x != coord_nil)
// Now check if there is a gap between the current point and the previous
// point. We know it's not OK to have gaps along the ellipse path!
if (!((currentPoint.x - 1 <= previousPoint.x) && (previousPoint.x <= currentPoint.x + 1) &&
(currentPoint.y - 1 <= previousPoint.y) && (previousPoint.y <= currentPoint.y + 1)))
// Find the missing point by defining its offset as a fraction
// between the current point offset and the previous point offset.
getPointOnEllipse(cx, cy, rx, ry, i - 0.5, &midPoint);
matrix[midPoint.x][midPoint.y] = true;
previousPoint.x = currentPoint.x;
previousPoint.y = currentPoint.y;
void ellipseInMatrix(bool **matrix, int width, int height)
// So the ellipse shall be stretched to the whole matrix
// with a one-pixel margin.
int cx = width / 2;
int cy = height / 2;
int rx = cx - 1;
int ry = cy - 1;
// Call the general-purpose ellipse rasterizing function.
rasterizeEllipse(matrix, cx, cy, rx, ry);
If you need to fill the ellipse with white pixels like the examples that you provided, you can use the following code instead to rasterize a filled ellipse. Call the filledEllipseInMatrix function with a similar syntax to the previous function.
#define pi (2 * acos(0.0))
#define coord_nil (-1)
struct point
int x;
int y;
double getEllipsePerimeter(int rx, int ry)
return pi * sqrt(2 * (rx * rx + ry * ry));
void getPointOnEllipse(int cx, int cy, int rx, int ry, double d, struct point *pp)
double theta = d * sqrt(2.0 / (rx * rx + ry * ry));
// double theta = 2 * pi * d / getEllipsePerimeter(rx, ry);
pp->x = (int) floor(cx + cos(theta) * rx);
pp->y = (int) floor(cy - sin(theta) * ry);
void fillBar(struct point seed, bool **matrix, int cx)
int bx;
if (seed.x > cx)
for (bx = seed.x;bx >= cx;bx--)
matrix[bx][seed.y] = true;
for (bx = seed.x;bx <= cx;bx++)
matrix[bx][seed.y] = true;
void rasterizeFilledEllipse(bool **matrix, int cx, int cy, int rx, int ry)
struct point currentPoint, midPoint;
struct point previousPoint = {coord_nil, coord_nil};
double perimeter = floor(getEllipsePerimeter(rx, ry));
double i;
// Loop over the perimeter of the ellipse to determine all points on the ellipse path.
for (i = 0.0;i < perimeter;i++)
// Find the current point and determine its coordinates.
getPointOnEllipse(cx, cy, rx, ry, i, &currentPoint);
// So fill the bar (horizontal line) that leads from
// the current point to the minor axis.
fillBar(currentPoint, matrix, cx);
// So check if the previous point exists. Please note that if the current
// point is the first point (i = 0), then there will be no previous point.
if (previousPoint.x != coord_nil)
// Now check if there is a gap between the current point and the previous
// point. We know it's not OK to have gaps along the ellipse path!
if (!((currentPoint.x - 1 <= previousPoint.x) && (previousPoint.x <= currentPoint.x + 1) &&
(currentPoint.y - 1 <= previousPoint.y) && (previousPoint.y <= currentPoint.y + 1)))
// Find the missing point by defining its offset as a fraction
// between the current point offset and the previous point offset.
getPointOnEllipse(cx, cy, rx, ry, i - 0.5, &midPoint);
fillBar(midPoint, matrix, cx);
previousPoint.x = currentPoint.x;
previousPoint.y = currentPoint.y;
void filledEllipseInMatrix(bool **matrix, int width, int height)
// So the ellipse shall be stretched to the whole matrix
// with a one-pixel margin.
int cx = width / 2;
int cy = height / 2;
int rx = cx - 1;
int ry = cy - 1;
// Call the general-purpose ellipse rasterizing function.
rasterizeFilledEllipse(matrix, cx, cy, rx, ry);

How to make performant AOT blur with variable kernel size?

What would be an effective single threaded scheduling for this type of code?
I'm trying to define blur but with a variable kernel size in AOT. I tried solution but I can't figure a good way to schedule it that would get me the same performance as making kernel size a GeneratorParam and pre compiling with different values.
Here is a snippet with the GeneratorParam:
// GeneratorParam<int32_t> kernelSize{"kernelOffset", 1};
int32_t kernelSize = 2*kernelOffset + 1;
Halide::Expr sum = input(x, y);
for (int i=1;i<kernelSize;i++) {
sum = sum + Halide::cast<uint16_t>(input(x, y+i));
blur_y(x, y) = sum/kernelSize;
Halide::Expr sum = blur_y(x, y);
for (int i=1;i<kernelSize;i++) {
sum = sum + blur_y(x+i, y);
blur_x(x, y) = sum/kernelSize;
// And the schedule
blur_y.compute_at(blur_x, y);
output.vectorize(x, 16);
And using solution
Halide::RDom box (0, kernelSize, "box");
blur_y(x, y) = Halide::undef<uint16_t>();
Halide::RDom ry (yMin+1, yMax-yMin, "ry");
blur_y(x, yMin) = Halide::cast<uint16_t>(0);
blur_y(x, yMin) += Halide::cast<uint16_t>(input(x, yMin+box))/kernelSize;
blur_y(x, ry) = blur_y(x, ry-1) + input_uint16(x, ry+kernelOffset-1)/kernelSize - input_uint16(x, ry-1-kernelOffset)/kernelSize;
blur_x(x, y) = Halide::undef<uint16_t>();
Halide::RDom rx (xMin+1, xMax-xMin, "rx");
blur_x(xMin, y) = Halide::cast<uint16_t>(0);
blur_x(xMin, y) += blur_y(xMin+box, y)/kernelSize;
blur_x(rx, y) = blur_x(rx-1, y) + blur_y(rx+kernelOffset, y)/kernelSize - blur_y(rx-1-kernelOffset, y)/kernelSize;
The only way to get the same speed between fixed and variable radius is to use the specialize scheduling directive to generate fixed code for specific radii. If you can JIT and are blurring lots of pixels at the same radii, it may be profitable to JIT a specific filter for a given radius.
Generally really fast, arbitrary radius, blurs use adaptive approaches in which large radii are handled by something like iterative box filtering, intermediate levels use separable convolution and very small radii may use non-separable convolution. The blur is often done in multiple passes combining multiple approaches.
