Any recommendation on how to implement efficient integral functions, like SumX and SumY, in GLSL shaders?
SumX(u) = Integration with respect to x = I(u0,y) + I(u1,y) +... + I(uN,y); u=normalized x coordinate
SumY(v) = Integration with respect to y = I(x,v0) + I(x,v1) +... + I(x,vN); v=normalized y coordinate
For instance the 5th pixel of the first line would be the sum of all five pixels on the first line. And the last pixel would be the sum of all previous pixels including the last pixel itself.
What you are asking for is called prefix sum or summed area table (SAT) for the 2D case (just so you find online resources more easily).
Summed area tables can be efficiently implemented on the GPU by decomposing into several parrallel prefix sum passes [1], [2].
The prefix sum can be accelerated by using local memory to store intermediate partial sums (see example in OpenCL or example in CUDA, the same can in principle be done in an OpenGL fragment shader as well with image load-store, or in a compute shader: OpenGL Super Bible example, similar example to be found in OpenGL Insights around page 280).
Note that you may quickly run into precision issues as the sum may get quite large for the rightmost (downmost) pixels. Integer or fp16 render targets will most likely result in failure due to overflow or lacking precision, fp32 will work most of the time.
Related
I am trying to compute the integral image (aka summed area table) of a texture I have in the GPU memory (a camera capture), the goal being to compute the adaptive threshold of said image. I'm using OpenGL ES 2.0, and still learning :).
I did a test with a simple gaussian blur shader (vertical/horizontal pass), which is working fine, but I need a way bigger variable average area for it to give satisfactory results.
I did implement a version of that algorithm on CPU before, but I'm a bit confused on how to implement that on a GPU.
I tried to do a (completely incorrect) test with just something like this for every fragment :
#version 100
#extension GL_OES_EGL_image_external : require
precision highp float;
uniform sampler2D u_Texture; // The input texture.
varying lowp vec2 v_TexCoordinate; // Interpolated texture coordinate per fragment.
uniform vec2 u_PixelDelta; // Pixel delta
void main()
{
// get neighboring pixels values
float center = texture2D(u_Texture, v_TexCoordinate).r;
float a = texture2D(u_Texture, v_TexCoordinate + vec2(u_PixelDelta.x * -1.0, 0.0)).r;
float b = texture2D(u_Texture, v_TexCoordinate + vec2(0.0, u_PixelDelta.y * 1.0)).r;
float c = texture2D(u_Texture, v_TexCoordinate + vec2(u_PixelDelta.x * -1.0, u_PixelDelta.y * 1.0)).r;
// compute value
float pixValue = center + a + b - c;
// Result stores value (R) and original gray value (G)
gl_FragColor = vec4(pixValue, center, center, 1.0);
}
And then another shader to get the area that I want and then get the average. This is obviously wrong as there's multiple execution units operating at the same time.
I know that the common way of computing a prefix sum on a GPU is to do it in two pass (vertical/horizontal, as discussed here on this thread or or here), but isn't there a problem here as there is a data dependency on each cell from the previous (top or left) one ?
I can't seem to understand the order in which the multiple execution units on a GPU will process the different fragments, and how a two-pass filter can solve that issue. As an example, if I have some values like this :
2 1 5
0 3 2
4 4 7
The two pass should give (first columns then rows):
2 1 5 2 3 8
2 4 7 -> 2 6 13
6 8 14 6 14 28
How can I be sure that, as an example, the value [0;2] will be computed as 6 (2 + 4) and not 4 (0 + 4, if the 0 hasn't been computed yet) ?
Also, as I understand that fragments are not pixels (If I'm not mistaken), would the values I store back in one of my texture in the first pass be the same in another pass if I use the exact same coordinates passed from the vertex shader, or will they be interpolated in some way ?
Tommy and Bartvbl address your questions about a summed-area table, but your core problem of an adaptive threshold may not need that.
As part of my open source GPUImage framework, I've done some experimentation with optimizing blurs over large radii using OpenGL ES. Generally, increasing blur radii leads to a significant increase in texture sampling and calculations per pixel, with an accompanying slowdown.
However, I found that for most blur operations you can apply a surprisingly effective optimization to cap the number of blur samples. If you downsample the image before blurring, blur at a smaller pixel radius (radius / downsampling factor), and then linearly upsample, you can arrive at a blurred image that is the equivalent of one blurred at a much larger pixel radius. In my tests, these downsampled, blurred, and then upsampled images look almost identical to the ones blurred based on the original image resolution. In fact, precision limits can lead to larger-radii blurs done at a native resolution breaking down in image quality past a certain size, where the downsampled ones maintain the proper image quality.
By adjusting the downsampling factor to keep the downsampled blur radius constant, you can achieve near constant-time blurring speeds in the face of increasing blur radii. For a adaptive threshold, the image quality should be good enough to use for your comparisons.
I use this approach in the Gaussian and box blurs within the latest version of the above-linked framework, so if you're running on Mac, iOS, or Linux, you can evaluate the results by trying out one of the sample applications. I have an adaptive threshold operation based on a box blur that uses this optimization, so you can see if the results there are what you want.
AS per the above, it's not going to be fantastic on a GPU. But assuming the cost of shunting data between the GPU and CPU is more troubling it may still be worth persevering.
The most obvious prima facie solution is to split horizontal/vertical as discussed. Use an additive blending mode, create a quad that draws the whole source image then e.g. for the horizontal step on a bitmap of width n issue a call that requests the quad be drawn n times, the 0th time at x = 0, the mth time at x = m. Then ping pong via an FBO, switching the target of buffer of the horizontal draw into the source texture for the vertical.
Memory accesses are probably O(n^2) (i.e. you'll probably cache quite well, but that's hardly a complete relief) so it's a fairly poor solution. You could improve it by divide and conquer by doing the same thing in bands — e.g. for the vertical step, independently sum individual rows of 8, after which the error in every row below the final is the failure to include whatever the sums are on that row. So perform a second pass to propagate those.
However an issue with accumulating in the frame buffer is clamping to avoid overflow — if you're expecting a value greater than 255 anywhere in the integral image then you're out of luck because the additive blending will clamp and GL_RG32I et al don't reach ES prior to 3.0.
The best solution I can think of to that, without using any vendor-specific extensions, is to split up the bits of your source image and combine channels after the fact. Supposing your source image were 4 bit and your image less than 256 pixels in both directions, you'd put one bit each in the R, G, B and A channels, perform the normal additive step, then run a quick recombine shader as value = A + (B*2) + (G*4) + (R*8). If your texture is larger or smaller in size or bit depth then scale up or down accordingly.
(platform specific observation: if you're on iOS then you've hopefully already got a CVOpenGLESTextureCache in the loop, which means you have CPU and GPU access to the same texture store, so you might well prefer to kick this step off to GCD. iOS is amongst the platforms supporting EXT_shader_framebuffer_fetch; if you have access to that then you can write any old blend function you like and at least ditch the combination step. Also you're guaranteed that preceding geometry has completed before you draw so if each strip writes its totals where it should and also to the line below then you can perform the ideal two-pixel-strips solution with no intermediate buffers or state changes)
What you attempt to do cannot be done in a fragment shader. GPU's are by nature very different to CPU's by executing their instructions in parallel, in massive numbers at the same time. Because of this, OpenGL does not make any guarantees about execution order, because the hardware physically doesn't allow it to.
So there is not really any defined order other than "whatever the GPU thread block scheduler decides".
Fragments are pixels, sorta-kinda. They are pixels that potentially end up on screen. If another triangle ends up in front of another, the previous calculated colour value is discarded. This happens regardless of whatever colour was stored at that pixel in the colour buffer previously.
As for creating the summed area table on the GPU, I think you may first want to look at GLSL "Compute Shaders", which are specifically made for this sort of thing.
I think you may be able to get this to work by creating a single thread for each row of pixels in the table, then have every thread "lag behind" by 1 pixel compared to the previous row.
In pseudocode:
int row_id = thread_id()
for column_index in (image.cols + image.rows):
int my_current_column_id = column_index - row_id
if my_current_column_id >= 0 and my_current_column_id < image.width:
// calculate sums
The catch of this method is that all threads should be guaranteed to execute their instructions simultaneously without getting ahead of one another. This is guaranteed in CUDA, but I'm not sure whether it is in OpenGL compute shaders. It may be a starting point for you, though.
It may look surprising for the beginner but the prefix sum or SAT calculation is suitable for parallelization. As the Hensley algorithm is the most intuitive to understand (also implemented in OpenGL), more work-efficient parallel methods are available, see CUDA scan. The paper from Sengupta discuss parallel method which seems state-of-the-art efficient method with reduce and down swap phases. These are valuable materials but they do not enter OpenGL shader implementations in detail. The closest document is the presentation you have found (it refers to Hensley publication), since it has some shader snippets. This is the job which is doable entirely in fragment shader with FBO Ping-Pong. Note that the FBO and its texture need to have internal format set to high precision - GL_RGB32F would be best but I am not sure if it is supported in OpenGL ES 2.0.
Currently I'm using Math.cos and Math.sin to move objects in a circle in my game, however I suspect it's slow (didn't make proper tests yet though) after reading a bit about it.
Are there any ways to calculate this in a faster way?. Been reading that one alternative could be to have a sort of hash table with stored pre-calculated results, like old people used it in the old times before the computer age.
Any input is appreciated.
Expanding on my comment, if you don't have any angular acceleration (the angular velocity stays constant -- this is a requirement for the object to remain traveling in a circle with constant radius without changing the center-pointing force, e.g. via tension in a string), then you can use the following strategy:
1) Compute B = angular_velocity * time_step_size. This is how much angle change the object needs to go through in a single time step.
2) Compute sinb = sin(B) and cosb = cos(B).
3)
Note that we want to change the angle from A to A+B (the object is going counterclockwise). In this derivation, the center of the circle we're orbiting is given by the origin.
Since the radius of the circle is constant, we know r*sin(A+B) = y_new = r*sin(A)cos(B) + r*cos(A)sin(B) = y_old * cos(B) + x_old*sin(B) and r*cos(A+B) = x_new = r*cos(A)*cos(B) - r*sin(A)sin(B) = x_old*cos(B) - y_old*sin(B).
We've removed the cosine and sine of anything we don't already know, so the Cartesian coordinates can be written as
x_new = x_old*cosb - y_old*sinb
y_new = x_old*sinb + y_old*cosb
No more cos or sin calls except in an initialization step which is called once. Obviously, this won't save you anything if B keeps changing for whatever reason (either angular velocity or time step size changes).
You'll notice this is the same as multiplying the position vector by a fixed rotation matrix. You can translate by the circle center and translate back if you don't want to only consider circles with a center at the origin.
First Edit
As #user5428643 mentions, this method is numerically unstable over time due to drift in the radius. You can probably correct this by periodically renormalizing x and y (x_new = x_old * r_const / sqrt(x_old^2 + y_old^2) and similarly for y every few thousand steps -- if you implement this, save the factor r_const / sqrt(x_old^2 + y_old^2) since it is the same for both x and y). I'll think about it some more and edit this answer if I come up with a better fix.
Second Edit
Some more comments on the numerical drift over time:
I did a couple of tests in C++ and python. In C++ using single precision floats, there is sizable drift even after 1 million time steps when B = 0.1. I used a circle with radius 1. In double precision, I didn't notice any drift visually after 100 million steps, but checking the radius shows that it is contaminated in the lower few digits. Doing the renormalization on every step (which is unnecessary if you're just doing visualization) results in an approximately 4 times slower running time versus the drifty version. However, even this version is about 2-3 times faster than using sin and cos on every iteration. I used full optimization (-O3) in g++. In python (using the math package) I only got a speed up of 2 between the drifty and normalized versions, however the sin and cos version actually slots in between these two -- it's almost exactly halfway between these two in terms of run time. Renormalizing every once in a few thousand steps would still make this faster, but it's not nearly as big a difference as my C++ version would indicate.
I didn't do too much scientific testing to get the timings, just a few tests with 1 million to 1 billion steps in increments of 10.
Sorry, not enough rep to comment.
The answers by #neocpp and #oliveryas01 would both be perfectly correct without roundoff error.
The answer by #oliveryas01, just using sine and cosine directly, and precalculating and storing many values if necessary, will work fine.
However, #neocpp's answer, repeatedly rotating by small angles using a rotation matrix, is numerically unstable; over time, the roundoff error in the radius will tend to grow exponentially, so if you run your programme for a long time the objects will slowly move off the circle, spiralling either inwards or outwards.
You can see this mathematically with a little numerical analysis: at each stage, the squared radius is approximately multiplied by a number which is approximately constant and approximately equal to 1, but almost certainly not exactly equal to 1 due to inexactness of floating point representations.
If course, if you're using double precision numbers and only trying to achieve a simple visual effect, this error may not be large enough to matter to you.
I would stick with sine and cosine if I were you. They're the most efficient way to do what you're trying to do. If you really want maximum performance then you should generate an array of x and y values from the sine and cosine values, then plug that array's values into the circle's position. This way, you aren't running sine and cosine repeatedly, instead only for one cycle.
Another possibility completely avoiding the trig functions would be use a polar-coordinate model, where you set the distance and angle. For example, you could set the x coordinate to be the distance, and the rotation to be the angle, as in...
var gameBoardPin:Sprite = new Sprite();
var gameEntity:Sprite = new YourGameEntityHere();
gameBoardPin.addChild( gameEntity );
...and in your loop...
// move gameEntity relative to the center of gameBoardPin
gameEntity.x = circleRadius;
// rotate gameBoardPin from its center causes gameEntity to rotate at the circleRadius
gameBoardPin.rotation = desiredAngleForMovingObject
gameBoardPin's x,y coordinates would be set to the center of rotation for gameEntity. So, if you wanted the gameEntity to rotate with a 100 pixel tether around the center of the stage, you might...
gameBoardPin.x = stage.stageWidth / 2;
gameBoardPin.y = stage.stageHeight / 2;
gameEntity.x = 100;
...and then in the loop you might...
desiredAngleForMovingObject += 2;
gameBoardPin.rotation = desiredAngleForMovingObject
With this method you're using degrees instead of radians.
I try to tilt compensate a magnetometer (BMX055) reading and tried various approaches I have found online, not a single one works.
I atually tried almost any result I found on Google.
I run this on an AVR, it would be extra awesome to find something that works without complex functions (trigonometry etc) for angles up to 50 degree.
I have a fused gravity vector (int16 signed in a float) from gyro+acc (1g gravity=16k).
attitude.vect_mag.x/y/z is a float but contains a 16bit integer ranging from around -250 to +250 per axis.
Currently I try this code:
float rollRadians = attitude.roll * DEG_TO_RAD / 10;
float pitchRadians = attitude.pitch * DEG_TO_RAD / 10;
float cosRoll = cos(rollRadians);
float sinRoll = sin(rollRadians);
float cosPitch = cos(pitchRadians);
float sinPitch = sin(pitchRadians);
float Xh = attitude.vect_mag.x * cosPitch + attitude.vect_mag.z * sinPitch;
float Yh = attitude.vect_mag.x * sinRoll * sinPitch + attitude.vect_mag.y * cosRoll - attitude.vect_mag.z *sinRoll * cosPitch;
float heading = atan2(Yh, Xh);
attitude.yaw = heading*RAD_TO_DEG;
The result is meaningless, but the values without tilt compensation are correct.
The uncompensated formula:
atan2(attitude.vect_mag.y,attitude.vect_mag.x);
works fine (when not tilted)
I am sort of clueless what is going wrong, the normal atan2 returns a good result (when balanced) but using the wide spread formulas for tilt compensation completely fails.
Do I have to keep the mag vector values within a specific range for the trigonometry to work ?
Any way to do the compensation without trig functions ?
I'd be glad for some help.
Update:
I found that the BMX055 magnetometer has X and Y inverted as well as Y axis is *-1
The sin/cos functions now seem to lead to a better result.
I am trying to implement the suggested vector algorithms, struggling so far :)
Let us see.
(First, forgive me a bit of style nagging. The keyword volatile means that the variable may change even if we do not change it ourselves in our code. This may happen with a memory position that is written by another process (interrupt request in AVR context). For the compiler volatile means that the variable has to be always loaded and stored into memory when used. See:
http://en.wikipedia.org/wiki/Volatile_variable
So, most likely you do not want to have any attributes to your floats.)
Your input:
three 12-bit (11 bits + sign) integers representing accelerometer data
three approximately 9-bit (8 bits + sign) integers representing the magnetic field
Good news (well...) is that your resolution is not that big, so you can use integer arithmetics, which is much faster. Bad news is that there is no simple magical one-liner which would solve your problem.
First of all, what would you like to have as the compass bearing when the device is tilted? Should the device act as if it was not tilted, or should it actually show the correct projection of the magnetic field lines on the screen? The latter is how an ordinary compass acts (if the needle moves at all when tilted). In that case you should not compensate for anything, and the device can show the fancy vertical tilt of the magnetic lines when rolled sideways.
In any case, try to avoid trigonometry, it takes a lot of code space and time. Vector arithmetics is much simpler, and most of the time you can make do with multiplys and adds.
Let us try to define your problem in vector terms. Actually you have two space vectors to start with, m pointing to the direction of the magnetic field, g to the direction of gravity. If I have understood your intention correctly, you need to have vector d which points along some fixed direction in the device. (If I think of a mobile phone, d would be a vector parallel to the screen left or right edges.)
With vector mathematics this looks rather simple:
g is a normal to a horizontal (truly horizontal) plane
the projection of m on this plane defines the direction a horizontal compass would show
the projection of d on the plane defines the "north" on the compass face
the angle between m and d gives the compass bearing
Now that we are not interested in the magnitude of the magnetic field, we can scale everything as we want. This reduces the need to use unity vectors which are expensive to calculate.
So, the maths will be something along these lines:
# projection of m on g (. represents dot product)
mp := m - g (m.g) / (g.g)
# projection of d on g
dp := d - g (d.g) / (g.g)
# angle between mp and dp
cos2 := (mp.dp)^2 / (mp.mp * dp.dp)
sgn1 := sign(mp.dp)
# create a vector 90 rotated from d on the plane defined by g (x is cross product)
drot := dp x g
sin2 := (mp.drot)^2 / (mp.mp * drot.drot)
sgn2 := sign(mp.drot)
After this you will have a sin^2 and cos^2 of the compass directions. You need to create a resolving function for one quadrant and then determine the correct quadrant by using the signs. The resolving function may sound difficult, but actually you just need to create a table lookup function for sin2/cos2 or cos2/sin2 (whichever is smaller). It is relatively fast, and only a few points are required in the lookup (with bilinear approximation even fewer).
So, as you can see, there are no trig functions around, and even no square roots around. Vector dots and crosses are just multiplys. The only slightly challenging trick is to scale the fixed point arithmetics to the correct scale in each calculation.
You might notice that there is a lot of room for optimization, as the same values are used several times. The first step is to get the algorithm run on a PC with floating point with the correct results. The optimizations come later.
(Sorry, I am not going to write the actual code here, but if there is something that needs clarifying, I'll be glad to help.)
I am using Laplacian of Gaussian for edge detection using a combination of what is described in http://homepages.inf.ed.ac.uk/rbf/HIPR2/log.htm and http://wwwmath.tau.ac.il/~turkel/notes/Maini.pdf
Simply put, I'm using this equation :
for(int i = -(kernelSize/2); i<=(kernelSize/2); i++)
{
for(int j = -(kernelSize/2); j<=(kernelSize/2); j++)
{
double L_xy = -1/(Math.PI * Math.pow(sigma,4))*(1 - ((Math.pow(i,2) + Math.pow(j,2))/(2*Math.pow(sigma,2))))*Math.exp(-((Math.pow(i,2) + Math.pow(j,2))/(2*Math.pow(sigma,2))));
L_xy*=426.3;
}
}
and using up the L_xy variable to build the LoG kernel.
The problem is, when the image size is larger, application of the same kernel is making the filter more sensitive to noise. The edge sharpness is also not the same.
Let me put an example here...
Suppose we've got this image:
Using a value of sigma = 0.9 and a kernel size of 5 x 5 matrix on a 480 × 264 pixel version of this image, we get the following output:
However, if we use the same values on a 1920 × 1080 pixels version of this image (same sigma value and kernel size), we get something like this:
[Both the images are scaled down version of an even larger image. The scaling down was done using a photo editor, which means the data contained in the images are not exactly similar. But, at least, they should be very near.]
Given that the larger image is roughly 4 times the smaller one... I also tried scaling the sigma by factor of 4 (sigma*=4) and the output was... you guessed it right, a black canvas.
Could you please help me realize how to implement a LoG edge detector that finds the same features from an input signal, even if the incoming signal is scaled up or down (scaling factor will be given).
Looking at your images, I suppose you are working in 24-bit RGB. When you increase your sigma, the response of your filter weakens accordingly, thus what you get in the larger image with a larger kernel are values close to zero, which are either truncated or so close to zero that your display cannot distinguish.
To make differentials across different scales comparable, you should use the scale-space differential operator (Lindeberg et al.):
Essentially, differential operators are applied to the Gaussian kernel function (G_{\sigma}) and the result (or alternatively the convolution kernel; it is just a scalar multiplier anyways) is scaled by \sigma^{\gamma}. Here L is the input image and LoG is Laplacian of Gaussian -image.
When the order of differential is 2, \gammais typically set to 2.
Then you should get quite similar magnitude in both images.
Sources:
[1] Lindeberg: "Scale-space theory in computer vision" 1993
[2] Frangi et al. "Multiscale vessel enhancement filtering" 1998
Like many 3d graphical programs, I have a bunch of objects that have their own model coordinates (from -1 to 1 in x, y, and z axis). Then, I have a matrix that takes it from model coordinates to world coordinates (using the location, rotation, and scale of the object being drawn). Finally, I have a second matrix to turn those world coordinates into canonical coordinates that OopenGL ES 2.0 will use to draw to the screen.
So, because one object can contain many vertices, all of which use the same transform into both world space, and canonical coordinates, it's faster to calculate the product of those two matrices once, and put each vertex through the resulting matrix, rather than putting each vertex through both matrices.
But, as far as I can tell, there doesn't seem to be a way in OpenGL ES 2.0 shaders to have it calculate the matrix once, and keep using it until the one of the two matrices used until glUniformMatrix4fv() (or another function to set a uniform) is called. So it seems like the only way to calculate the matrix once would be to do it on the CPU, and then result to the GPU using a uniform. Otherwise, when something like:
gl_Position = uProjection * uMV * aPosition;
it will calculate it over and over again, which seems like it would waste time.
So, which way is usually considered standard? Or is there a different way that I am completely missing? As far as I could tell, the shader used to implement the OpenGL ES 1.1 pipeline in the OpenGL ES 2.0 Programming Guide only used one matrix, so is that used more?
First, the correct OpenGL term for "canonical coordinates" is clip space.
Second, it should be this:
gl_Position = uProjection * (uMV * aPosition);
What you posted does a matrix/matrix multiply followed by a matrix/vector multiply. This version does 2 matrix/vector multiplies. That's a substantial difference.
You're using shader-based hardware; how you handle matrices is up to you. There is nothing that is "considered standard"; you do what you best need to do.
That being said, unless you are doing lighting in model space, you will often need some intermediary between model space and 4D homogeneous clip-space. This is the space you transform the positions and normals into in order to compute the light direction, dot(N, L), and so forth.
Personally, I wouldn't suggest world space for reasons that I explain thoroughly here. But whether it's world space, camera space, or something else, you will generally have some intermediate space that you need positions to be in. At which point, the above code becomes necessary, and thus there is no time wasted.