How to make performant AOT blur with variable kernel size? - halide

What would be an effective single threaded scheduling for this type of code?
I'm trying to define blur but with a variable kernel size in AOT. I tried https://github.com/halide/Halide/issues/180 solution but I can't figure a good way to schedule it that would get me the same performance as making kernel size a GeneratorParam and pre compiling with different values.
Here is a snippet with the GeneratorParam:
// GeneratorParam<int32_t> kernelSize{"kernelOffset", 1};
int32_t kernelSize = 2*kernelOffset + 1;
{
Halide::Expr sum = input(x, y);
for (int i=1;i<kernelSize;i++) {
sum = sum + Halide::cast<uint16_t>(input(x, y+i));
}
blur_y(x, y) = sum/kernelSize;
}
{
Halide::Expr sum = blur_y(x, y);
for (int i=1;i<kernelSize;i++) {
sum = sum + blur_y(x+i, y);
}
blur_x(x, y) = sum/kernelSize;
}
...
// And the schedule
blur_x.compute_root();
blur_y.compute_at(blur_x, y);
output.vectorize(x, 16);
And using https://github.com/halide/Halide/issues/180 solution
Halide::RDom box (0, kernelSize, "box");
blur_y(x, y) = Halide::undef<uint16_t>();
{
Halide::RDom ry (yMin+1, yMax-yMin, "ry");
blur_y(x, yMin) = Halide::cast<uint16_t>(0);
blur_y(x, yMin) += Halide::cast<uint16_t>(input(x, yMin+box))/kernelSize;
blur_y(x, ry) = blur_y(x, ry-1) + input_uint16(x, ry+kernelOffset-1)/kernelSize - input_uint16(x, ry-1-kernelOffset)/kernelSize;
}
blur_x(x, y) = Halide::undef<uint16_t>();
{
Halide::RDom rx (xMin+1, xMax-xMin, "rx");
blur_x(xMin, y) = Halide::cast<uint16_t>(0);
blur_x(xMin, y) += blur_y(xMin+box, y)/kernelSize;
blur_x(rx, y) = blur_x(rx-1, y) + blur_y(rx+kernelOffset, y)/kernelSize - blur_y(rx-1-kernelOffset, y)/kernelSize;
}

The only way to get the same speed between fixed and variable radius is to use the specialize scheduling directive to generate fixed code for specific radii. If you can JIT and are blurring lots of pixels at the same radii, it may be profitable to JIT a specific filter for a given radius.
Generally really fast, arbitrary radius, blurs use adaptive approaches in which large radii are handled by something like iterative box filtering, intermediate levels use separable convolution and very small radii may use non-separable convolution. The blur is often done in multiple passes combining multiple approaches.

Related

How to make this pattern to expand and shrink back

i have a task to make a pattern of circles and squares as described on photo, and i need to animate it so that all objects smoothly increase to four times the size and then shrink back to their original size and this is repeated. i tried but i cant understand problem
{
size(500,500);
background(#A5A3A3);
noFill();
rectMode(CENTER);
ellipseMode(CENTER);
}
void pattern(int a, int b)
{
boolean isShrinking = false;
for(int x = 0; x <= width; x += a){
for(int y = 0; y <= height; y += a){
stroke(#1B08FF);
ellipse(x,y,a,a);
stroke(#FF0000);
rect(x,y,a,a);
stroke(#0BFF00);
ellipse(x+25,y+25,a/2,a/2);
if (isShrinking){a -= b;}
else {a += b;}
if (a == 50 || a == 200){
isShrinking = !isShrinking ; }
}
}
}
void draw()
{
pattern(50,1);
}
this is what pattern need to look like
Great that you've posted your attempt.
From what you presented I can't understand the problem either. If this is an assignment, perhaps try to get more clarifications ?
If you comment you the isShrinking part of the code indeed you have an drawing similar to image you posted.
animate it so that all objects smoothly increase to four times the size and then shrink back to their original size and this is repeated
Does that simply mean scaling the whole pattern ?
If so, you can make use of the sine function (sin()) and the map() function to achieve that:
sin(), as the reference mentions, returns a value between -1 and 1 when you pass it an angle between 0 and 2 * PI (because in Processing trig. functions use radians not degrees for angles)
You can use frameCount divided by a fractional value to mimic an even increasing angle. (Even if you go around the circle multiple times (angle > 2 * PI), sin() will still return a value between -1 and 1)
map() takes a single value from one number range and maps it to another. (In your case from sin()'s result (-1,1) to the scale range (1,4)
Here's a tweaked version of your code with the above notes:
void setup()
{
size(500, 500, FX2D);
background(#A5A3A3);
noFill();
rectMode(CENTER);
ellipseMode(CENTER);
}
void pattern(int a)
{
for (int x = 0; x <= width; x += a) {
for (int y = 0; y <= height; y += a) {
stroke(#1B08FF);
ellipse(x, y, a, a);
stroke(#FF0000);
rect(x, y, a, a);
stroke(#0BFF00);
ellipse(x+25, y+25, a/2, a/2);
}
}
}
void draw()
{
// clear frame (previous drawings)
background(255);
// use the frame number as if it's an angle
float angleInRadians = frameCount * .01;
// map the sin of the frame based angle to the scale range
float sinAsScale = map(sin(angleInRadians), -1, 1, 1, 4);
// apply the scale
scale(sinAsScale);
// render the pattern (at current scale)
pattern(50);
}
(I've chosen the FX2D renderer because it's smoother in this case.
Additionally I advise in the future formatting the code. It makes it so much easier to read and it barely takes any effort (press Ctrl+T). On the long run you'll read code more than you'll write it, especially on large programs and heaving code that's easy to read will save you plenty of time and potentially headaches.)

Why is my code for a double pendulum returning NaN?

When I print my acceleration and velocity, they both start (seemingly) normal. Shortly, they start getting very big, then return -Infinity, then return NaN. I have tried my best with the math/physics aspect, but my knowledge is limited, so be gentle. Any help would be appreciated.
float ang1, ang2, vel1, vel2, acc1, acc2, l1, l2, m1, m2, g;
void setup() {
background(255);
size(600, 600);
stroke(0);
strokeWeight(3);
g = 9.81;
m1 = 10;
m2 = 10;
l1 = 100;
l2 = 100;
vel1 = 0;
vel2 = 0;
acc1 = 0;
acc2 = 0;
ang1 = random(0, TWO_PI);
ang2 = random(0, TWO_PI);
}
void draw() {
pushMatrix();
background(255);
translate(width/2, height/2); // move origin
rotate(PI/2); // make 0 degrees face downward
ellipse(0, 0, 5, 5); // dot at origin
ellipse(l1*cos(ang1), l1*sin(ang1), 10, 10); // circle at m1
ellipse(l2*cos(ang2) + l1*cos(ang1), l2*sin(ang2) + l1*sin(ang1), 10, 10); // circle at m2
line(0, 0, l1*cos(ang1), l1*sin(ang1)); // arm 1
line(l1*cos(ang1), l1*sin(ang1), l2*cos(ang2) + l1*cos(ang1), l2*sin(ang2) + l1*sin(ang1)); // arm 2
float mu = 1 + (m1/m2);
acc1 = (g*(sin(ang2)*cos(ang1-ang2)-mu*sin(ang1))-(l2*vel2*vel2+l1*vel1*vel1*cos(ang1-ang2))*sin(ang1-ang2))/(l1*(mu-cos(ang1-ang2)*cos(ang1-ang2)));
acc2 = (mu*g*(sin(ang1)*cos(ang1-ang2)-sin(ang2))+(mu*l1*vel1*vel1+l2*vel2*vel2*cos(ang1-ang2))*sin(ang1-ang2))/(l2*(mu-cos(ang1-ang2)*cos(ang1-ang2)));
vel1 += acc1;
vel2 += acc2;
ang1 += vel1;
ang2 += vel2;
println(acc1, acc2, vel1, vel2);
popMatrix();
}
You haven't done anything wrong in your code, but the application of this mathematical technique is tricky.
This is a general problem with using numerical "solutions" to differential equations. Similar things happen if you try to simulate a bouncing ball:
//physics variables:
float g = 9.81;
float x = 200;
float y = 200;
float yvel = 0;
float radius = 10;
//graphing variables:
float[] yHist;
int iterator;
void setup() {
size(800, 400);
iterator = 0;
yHist = new float[width];
}
void draw() {
background(255);
y += yvel;
yvel += g;
if (y + radius > height) {
yvel = -yvel;
}
ellipse(x, y, radius*2, radius*2);
//graphing:
yHist[iterator] = height - y;
beginShape();
for (int i = 0; i < yHist.length; i++) {
vertex(i,
height - 0.1*yHist[i]
);
}
endShape();
iterator = (iterator + 1)%width;
}
If you run that code, you'll notice that the ball seems to bounce higher every single time. Obviously this does not happen in real life, nor should it happen even in ideal, lossless scenarios. So what happened here?
If you've ever used Euler's method for solving differential equations, you might see something about what's happening here. Really, what we are doing when we code simulations of differential equations, we are applying Euler's method. In the case of the bouncing ball, the real curve is concave down (except at the points when it bounces). Euler's method always gives an overestimate when the real solution is concave down. That means that every frame, the computer guesses a little bit too high. These errors add up, and the ball bounces higher and higher.
Similarly, with your pendulum, it's getting a little bit more energy almost every single frame. This is a general problem with using numerical solutions. They are simply inaccurate. So what do we do?
In the case of the ball, we can avoid using a numerical solution altogether, and go to an analytical solution. I won't go through how I got the solution, but here is the different section:
float h0;
float t = 0;
float pd;
void setup() {
size(400, 400);
iterator = 0;
yHist = new float[width];
noFill();
h0 = height - y;
pd = 2*sqrt(h0/g);
}
void draw() {
background(255);
y = g*sq((t-pd/2)%pd - pd/2) + height - h0;
t += 0.5;
ellipse(x, y, radius*2, radius*2);
... etc.
This is all well and good for a bouncing ball, but a double pendulum is a much more complex system. There is no fully analytical solution to the double pendulum problem. So how do we minimize error in a numerical solution?
One strategy is to take smaller steps. The smaller the steps you take, the closer you are to the real solution. You can do this by reducing g (this might feel like cheating, but think for a minute about the units you're using. g=9.81 m/s^2. How does that translate to pixels and frames?). This will also make the pendulum move slower on the screen. If you want to increase accuracy without changing the viewing pace, you can take many small steps before rendering the frame. Consider changing lines 39-46 to
int substepCount = 1000;
for (int i = 0; i < substepCount; i++) {
acc1 = (g*(sin(ang2)*cos(ang1-ang2)-mu*sin(ang1))-(l2*vel2*vel2+l1*vel1*vel1*cos(ang1-ang2))*sin(ang1-ang2))/(l1*(mu-cos(ang1-ang2)*cos(ang1-ang2)));
acc2 = (mu*g*(sin(ang1)*cos(ang1-ang2)-sin(ang2))+(mu*l1*vel1*vel1+l2*vel2*vel2*cos(ang1-ang2))*sin(ang1-ang2))/(l2*(mu-cos(ang1-ang2)*cos(ang1-ang2)));
vel1 += acc1/substepCount;
vel2 += acc2/substepCount;
ang1 += vel1/substepCount;
ang2 += vel2/substepCount;
}
This changes your one big step to 1000 smaller steps, making it much more accurate. I tested that part out and it continued for over 20000 frames multiple times with no NaN errors. It might devolve into NaN at some point, but this allows it to last much longer.
EDIT:
I also highly recommend using % TWO_PI when incrementing the angles:
ang1 = (ang1 + vel1/substepCount) % TWO_PI;
ang2 = (ang2 + vel2/substepCount) % TWO_PI;
because it makes the angle measurements MUCH more accurate in the later times.
When you don't do this, if vel1 is positive for a long time, then ang1 gets bigger and bigger. Once ang1 is greater than 1, the computer needs a bit to indicate the ones place, at the expense of an extra digit on the end. Since numbers are stored using binary, this happens again when ang1 > 2, and again when ang1 > 4, and so on.
If you keep it between -PI and PI (which is what % does in this case), you only need a bit for the sign and a bit for the ones place, and all the remaining bits can be used to measure the angle to the highest possible precision. This is actually important: if vel1/substepCount < 1/32768, and ang1 doesn't have enough bits to measure out to the 1/32768 place, then ang1 will not register the change.
To see the effects of this difference, give ang1 and ang2 really high initial values:
g = 0.0981;
ang1 = 101.1*PI;
ang2 = 101.1*PI;
If you don't use % TWO_PI, it approximates low velocities to zero, resulting in a bunch of stopping and starting.
END EDIT
If you need it to go for a ridiculously long time, so long that it isn't feasible to increase substepCount sufficiently, there is another thing you can do. This all comes about because vel increases to an extreme degree. You can constrain vel1 and vel2 so that they don't get too big.
In this case, I would recommend limiting the velocities based on conservation of energy. There is a maximum amount of mechanical energy allowed in the system based on the initial conditions. You cannot have more mechanical energy than the initial potential energy. Potential energy can be calculated based on the angles:
U(ang1, ang2) = -g*((m1+m2)*l1*cos(ang1) + m2*l2*cos(ang2))
Therefore we can determine exactly how much kinetic energy is in the system at any moment: The initial values of ang1 and ang2 give us the total mechanical energy. The current values of ang1 and ang2 give us the current potential energy. Then we can simply take the difference in order to find the current kinetic energy.
The way that pendulum motion is typically described does not lend itself to computing kinetic energy. It is possible, but I'm not going to do it here. My recommendation for constraining the velocities of the two pendulums is as follows:
Calculate the kinetic energy of the two arms separately.
Take the ratio between them
Calculate the total kinetic energy currently in the two arms.
Distribute the kinetic energy in the same proportions as you calculated in step 2. e.g. If you calculate that there is twice as much kinetic energy in the further mass as there is in the closer mass, put 1/3 of the kinetic energy in the closer mass and 2/3 in the further one.
I hope this helps, let me know if you have any questions.

Position of object with n known points and distances

I'm doing some work with triangulating the position of a receiver based on its distance to known points in space. I basically get a set of distances D[N] from the nearby broadcast points, and have a lookup table to give me the X, Y, and Z values for each. I need to find the X, Y, and Z points of my receiver.
I've seen that you can use trilateration to solve this problem in 2D cases, but it doesn't work for 3D. I would like to be able to use N points in order to improve accuracy, but I suppose I could also use the closest 4 points.
My issue is I have no idea how to programmatically solve the system of equations algebraically so it can be done in my program. I've seen a few solutions using things like Matlab, but I don't have the same tools.
This seems to solve the problem, if someone knows how to translate Matlab into a C language (I've never used Matlab): Determine the position of a point in 3D space given the distance to N points with known coordinates
Here is my solution in C++(should be easy to convert to plain C). It does not use any advanced algebra so it does not require any non-standard libraries. It works good when the number of points is rather big(the error gets smaller when the number of points grows), so it is better to pass all the points that you have to the solve method. It minimizes the sum of squared differences using sub-gradient descent.
const int ITER = 2000;
const double ALPHA = 2.0;
const double RATIO = 0.99;
double sqr(double a) {
return a * a;
}
struct Point {
double x;
double y;
double z;
Point(double _x=0.0, double _y=0.0, double _z=0.0): x(_x), y(_y), z(_z) {}
double dist(const Point &other) const {
return sqrt(sqr(x - other.x) + sqr(y - other.y) + sqr(z - other.z));
}
Point operator + (const Point& other) const {
return Point(x + other.x, y + other.y, z + other.z);
}
Point operator - (const Point& other) const {
return Point(x - other.x, y - other.y, z - other.z);
}
Point operator * (const double mul) const {
return Point(x * mul, y * mul, z * mul);
}
};
Point solve(const vector<Point>& given, const vector<double>& dist) {
Point res;
double alpha = ALPHA;
for (int iter = 0; iter < ITER; iter++) {
Point delta;
for (int i = 0; i < given.size(); i++) {
double d = res.dist(given[i]);
Point diff = (given[i] - res) * (alpha * (d - dist[i]) / max(dist[i], d));
delta = delta + diff;
}
delta = delta * (1.0 / given.size());
alpha *= RATIO;
res = res + delta;
}
return res;
}
Here is an answer for using Matlab (which is not what you are asking for), which uses the Nelder-Mead simplex optimization algorithm. Because you do not have access to Matlab, you could use R (freely available). The code above is easily translatable to R and in R you can use the Nelder-Mead algorithm (to replace 'fminsearch' using the neldermead package. For differences between R and Matlab (and Octave), see: http://cran.r-project.org/doc/contrib/R-and-octave.txt
function Ate=GetCoordinate(Atr,Dte)
% Atr = coordinates for known points
% Dte = distances for receiver to each row in Atr
% At5e = coordinate for receiver
[~,ii]=sort(Dte,'ascend');
Ate=mean(Atr(ii(1:4),:)); % (reasonable) start point
[Ate,~]=fminsearch(#(Ate) ED(Ate,Atr,Dte),Ate); % Uses Nelder-Mead simplex algorithm to find optimum
end
function d=ED(Ate,Atr,Dte) % calculates the sum of the squared difference between the measured distances and distances based on coordinate Ate (for receiver)
for k=1:size(Dte,1)
d(k,1)=sqrt((Atr(k,:)-Ate)*(Atr(k,:)-Ate)'); % Euclidean distance
end
d=sqrt(sum((Dte-d).^2));
end

Linear interpolation along a "Bresenham line"

I'm using linear interpolation for animating an object between two 2d coordinates on the screen. This is pretty close to what I want, but because of rounding, I get a jagged motion. In ASCII art:
ooo
ooo
ooo
oo
Notice how it walks in a Manhattan grid, instead of taking 45 degree turns. What I'd like is linear interpolation along the line which Bresenham's algorithm would have created:
oo
oo
oo
oo
For each x there is only one corresponding y. (And swap x/y for a line that is steep)
So why don't I just use Bresenham's algorithm? I certainly could, but that algorithm is iterative, and I'd like to know just one coordinate along the line.
I am going to try solving this by linearly interpolating the x coordinate, round it to the pixel grid, and then finding the corresponding y. (Again, swap x/y for steep lines). No matter how that solution pans out, though, I'd be interested in other suggestion and maybe previous experience.
Bresenham's algorithm for lines was introduced to draw a complete line a bit faster than usual approaches. It has two major advantages:
It works on integer variables
It works iteratively, which is fast, when drawing the complete line
The first advantage is not a great deal, if you calculate only some coordinates. The second advantage turns out as a disadvantage when calculating only some coordinates. So after all, there is no need to use Bresenham's algorithm.
Instead, you can use a different algorithm, which results in the same line. For example the DDA (digital differential analyzer). This is basically, the same approach you mentioned.
First step: Calculate the slope.
m = (y_end - y_start) / (x_end - x_start)
Second step: Calculate the iteration step, which is simply:
i = x - x_start
Third step: Calculate the coresponding y-value:
y = y_start + i * m
= y_start + (x - x_start) * (y_end - y_start) / (x_end - x_start)
Here's the solution I ended up with:
public static Vector2 BresenhamLerp(Vector2 a, Vector2 b, float percent)
{
if (a.x == b.x || Math.Abs(a.x - b.x) < Math.Abs(a.y - b.y))
{
// Didn't do this part yet. Basically, we just need to recurse
// with x/y swapped and swap result on return
}
Vector2 result;
result.x = Math.Round((1-percent) * a.x + percent * b.x);
float adjustedPercent = (result.x - a.x + 0.5f) / (b.x - a.x);
result.y = Math.Round((1-adjustedPercent) * a.y + adjustedPercent * b.y);
return result;
}
This is what I just figured out would work. Probably not the most beautiful interpolations, but it is just a 1-2 float additions per iteration on the line with a one-time precalculation. Works by calculating the number of steps on a manhattan matrix.
Ah, and it does not yet catch the case when the line is vertical (dx = 0)
This is the naive bresenham, but the iterations could in theory only use integers as well. If you want to get rid of the float color value, things are going to get harder because the line might be longer than the color difference, so delta-color < 1.
void Brepolate( uint8_t* pColorBuffer, uint8_t cs, float xs, float ys, float zs, uint8_t ce, float xe, float ye, float ze )
{
float nColSteps = (xe - xs) + (ye - ys);
float fColInc = ((float)cs - (float)ce) / nColSteps;
float fCol = cs;
float dx = xe - xs;
float dy = ye - ys;
float fCol = cs;
if (dx > 0.5)
{
float de = fabs( dy / dx );
float re = de - 0.5f;
uint32_t iY = ys;
uint32_t iX;
for ( uint32_t iX = xs;
iX <= xe;
iX++ )
{
uint32_t off = surf.Offset( iX, iY );
pColorBuffer[off] = fCol;
re += de;
if (re >= 0.5f)
{
iY++;
re -= 1.0f;
fCol += fColInc;
}
fCol += fColInc;
}
}
}

Fast algorithm for image distortion

I am working on a tool which distorts images, the purpose of the distortion is to project images to a sphere screen. The desired output is as the following image.
The code I use is as follow - for every Point(x, y) in the destination area, I calculate the corresponding pixel (sourceX, sourceY) in the original image to retrieve from.
But this approach is awkwardly slow, in my test, processing the sunset.jpg (800*600) requires more than 1500ms, if I remove the Mathematical/Trigonometrical calculations, calling cvGet2D and cvSet2D alone require more than 1200ms.
Is there a better way to do this? I am using Emgu CV (a .NET wrapper library for OpenCV) but examples in other language is also OK.
private static void DistortSingleImage()
{
System.Diagnostics.Stopwatch stopWatch = System.Diagnostics.Stopwatch.StartNew();
using (Image<Bgr, Byte> origImage = new Image<Bgr, Byte>("sunset.jpg"))
{
int frameH = origImage.Height;
using (Image<Bgr, Byte> distortImage = new Image<Bgr, Byte>(2 * frameH, 2 * frameH))
{
MCvScalar pixel;
for (int x = 0; x < 2 * frameH; x++)
{
for (int y = 0; y < 2 * frameH; y++)
{
if (x == frameH && y == frameH) continue;
int x1 = x - frameH;
int y1 = y - frameH;
if (x1 * x1 + y1 * y1 < frameH * frameH)
{
double radius = Math.Sqrt(x1 * x1 + y1 * y1);
double theta = Math.Acos(x1 / radius);
int sourceX = (int)(theta * (origImage.Width - 1) / Math.PI);
int sourceY = (int)radius;
pixel = CvInvoke.cvGet2D(origImage.Ptr, sourceY, sourceX);
CvInvoke.cvSet2D(distortImage, y, x, pixel);
}
}
}
distortImage.Save("Distort.jpg");
}
Console.WriteLine(stopWatch.ElapsedMilliseconds);
}
}
From my personal experience, I was doing some stereoscopic vision stuff, the best way to talk to openCV is through own wrapper, you could put your method in c++ and call it from c#, that would give you 1 call to native, faster code, and because under the hood Emgu's keeping OpenCV data, it's also possible to create an image with emgu, process it natively and enjoy processed image in c# again.
The get/set methods looks like Gdi's GetPixel / SetPixel ones, and, according to documentation they are "slow but safe way".
For staying with Emgu only, documentation tells that if you want to iterate over pixels, you should access .Data property:
The safe (slow) way
Suppose you are working on an Image. You can obtain the pixel on the y-th row and x-th column by calling
Bgr color = img[y, x];
Setting the pixel on the y-th row and x-th column is also simple
img[y,x] = color;
The fast way
The Image pixels values are stored in the Data property, a 3D array. Use this property if you need to iterate through the pixel values of the image.

Resources