Implementing sparse matrix construction and multiplication in OpenGL ES - opengl-es

I have googled around but havnt found an answer that suits me for OpenGL.
I want to construct a sparse matrix with a single diagonal and around 9 off-diagonals. These diagonals arent necessarily next to the main diagonal and they wrap around. Each diagonal is an image in row-major format i.e. a vector of size NxM.
The size of the matrix is (NxM)x(NxM)
My question is as follows:
After some messing around with the math I have arrived at the basic units of my operation. It involves a pixel by pixel multiplication of two images (WITHOUT limiting the value of the result i.e. so it can be above 1 or below 0), storing the resulting image and then adding a bunch of the resulting images (SAME as above).
How can I multiply and add images on a pixel by pixel basis in OpenGL? Is it easier in 1.1 or 2.0? Will use of textures cause hard maxing of the results to between 0 and 1? Will this maximize the use of the gpu cores?

In order to be able to store values outside the [0-1] range you would have to use floating point textures. There is no support in OpenGL ES 1.1 and for OpenGL ES 2.0 it is an optional extension (See other SO question).
In case your implementation supports it you could then write a fragment program to do the required math.
In OpenGL ES 1.1 you could use the glTexEnv call to set up how the pixels from different texture units are supposed to be combined. You could then use "modulate" or "add" to multiply/add the values. The result would be clamped to [0,1] range though.

Related

How to sum triangle pixels with OpenGL ES

I am new to OpenGL ES. I am currently reading docs about 2.0 version of OpenGL ES. I have a triangular 2D mesh, a 2D RGB texture and i need to compute, for every triangle, the following quantities:
where N is the number of pixels of a given triangle. This quantities are needed for further CPU processing. The idea would be to use GPU rasterization to sum quantities over triangles. I am not able to see how to do this with OpenGL ES 2.0 (which is the most popular version among android devices). Another question i have is: is it possible to do this type of computation with OpenGL ES 3.0?
I am not able to see how to do this with OpenGL ES 2.0
You can't; the API simply isn't designed to do it.
Is it possible to do this type of computation with OpenGL ES 3.0?
In the general case, no. If you can use OpenGL ES 3.1 and if you can control the input geometry then a viable algorithm would be:
Add a vertex attribute which is the primitive ID for each triangle in the mesh (we can use as an array index).
Allocate an atomics buffer GL_ATOMIC_COUNTER_BUFFER with one atomic per primitive, which is pre-zeroed.
In the fragment shader increment the atomic corresponding the current primitive (loaded from the vertex attribute).
Performance is likely to be pretty horrible though - atomics generally suck for most GPU implementations.

Create OpenGL ES texture with two float16 values (beginner)

I need to create a texture for an OpenGL ES 2.0 application with the following specs:
Each pixel has two components (lets call them r and g in the fragment shader).
Each pixel component is a 16 bit float.
That means every pixel in the texture has 4 bytes (2 bytes / 16 bit for each component).
The fragment shader should be able to sample the texture as 2 float16 components.
All formats must be supported on OpenGL ES 2.0 and as efficient as possible.
How would the appropriate glTexImage2D call look?
Regards
Neither floating point textures nor floating point render targets are supported in OpenGL ES 2.x. The short answer is therefore "you can't do what you are trying to do", at least not natively.
You can emulate higher precision by packing pairs of values into a RGBA8 texture or render target, e.g. the pair of RG values is one value, and BA is the other, but you'll have to pack/unpack the component 8-bit unorms yourself in shader code. This is quite a common solution in deferred rendering G-buffers for example, but can be relatively expensive on some of the lower-end mobile GPU parts (given it's basically just overhead, rather than useful rendering).

Summed area table in GLSL and GPU fragment shader execution

I am trying to compute the integral image (aka summed area table) of a texture I have in the GPU memory (a camera capture), the goal being to compute the adaptive threshold of said image. I'm using OpenGL ES 2.0, and still learning :).
I did a test with a simple gaussian blur shader (vertical/horizontal pass), which is working fine, but I need a way bigger variable average area for it to give satisfactory results.
I did implement a version of that algorithm on CPU before, but I'm a bit confused on how to implement that on a GPU.
I tried to do a (completely incorrect) test with just something like this for every fragment :
#version 100
#extension GL_OES_EGL_image_external : require
precision highp float;
uniform sampler2D u_Texture; // The input texture.
varying lowp vec2 v_TexCoordinate; // Interpolated texture coordinate per fragment.
uniform vec2 u_PixelDelta; // Pixel delta
void main()
{
// get neighboring pixels values
float center = texture2D(u_Texture, v_TexCoordinate).r;
float a = texture2D(u_Texture, v_TexCoordinate + vec2(u_PixelDelta.x * -1.0, 0.0)).r;
float b = texture2D(u_Texture, v_TexCoordinate + vec2(0.0, u_PixelDelta.y * 1.0)).r;
float c = texture2D(u_Texture, v_TexCoordinate + vec2(u_PixelDelta.x * -1.0, u_PixelDelta.y * 1.0)).r;
// compute value
float pixValue = center + a + b - c;
// Result stores value (R) and original gray value (G)
gl_FragColor = vec4(pixValue, center, center, 1.0);
}
And then another shader to get the area that I want and then get the average. This is obviously wrong as there's multiple execution units operating at the same time.
I know that the common way of computing a prefix sum on a GPU is to do it in two pass (vertical/horizontal, as discussed here on this thread or or here), but isn't there a problem here as there is a data dependency on each cell from the previous (top or left) one ?
I can't seem to understand the order in which the multiple execution units on a GPU will process the different fragments, and how a two-pass filter can solve that issue. As an example, if I have some values like this :
2 1 5
0 3 2
4 4 7
The two pass should give (first columns then rows):
2 1 5 2 3 8
2 4 7 -> 2 6 13
6 8 14 6 14 28
How can I be sure that, as an example, the value [0;2] will be computed as 6 (2 + 4) and not 4 (0 + 4, if the 0 hasn't been computed yet) ?
Also, as I understand that fragments are not pixels (If I'm not mistaken), would the values I store back in one of my texture in the first pass be the same in another pass if I use the exact same coordinates passed from the vertex shader, or will they be interpolated in some way ?
Tommy and Bartvbl address your questions about a summed-area table, but your core problem of an adaptive threshold may not need that.
As part of my open source GPUImage framework, I've done some experimentation with optimizing blurs over large radii using OpenGL ES. Generally, increasing blur radii leads to a significant increase in texture sampling and calculations per pixel, with an accompanying slowdown.
However, I found that for most blur operations you can apply a surprisingly effective optimization to cap the number of blur samples. If you downsample the image before blurring, blur at a smaller pixel radius (radius / downsampling factor), and then linearly upsample, you can arrive at a blurred image that is the equivalent of one blurred at a much larger pixel radius. In my tests, these downsampled, blurred, and then upsampled images look almost identical to the ones blurred based on the original image resolution. In fact, precision limits can lead to larger-radii blurs done at a native resolution breaking down in image quality past a certain size, where the downsampled ones maintain the proper image quality.
By adjusting the downsampling factor to keep the downsampled blur radius constant, you can achieve near constant-time blurring speeds in the face of increasing blur radii. For a adaptive threshold, the image quality should be good enough to use for your comparisons.
I use this approach in the Gaussian and box blurs within the latest version of the above-linked framework, so if you're running on Mac, iOS, or Linux, you can evaluate the results by trying out one of the sample applications. I have an adaptive threshold operation based on a box blur that uses this optimization, so you can see if the results there are what you want.
AS per the above, it's not going to be fantastic on a GPU. But assuming the cost of shunting data between the GPU and CPU is more troubling it may still be worth persevering.
The most obvious prima facie solution is to split horizontal/vertical as discussed. Use an additive blending mode, create a quad that draws the whole source image then e.g. for the horizontal step on a bitmap of width n issue a call that requests the quad be drawn n times, the 0th time at x = 0, the mth time at x = m. Then ping pong via an FBO, switching the target of buffer of the horizontal draw into the source texture for the vertical.
Memory accesses are probably O(n^2) (i.e. you'll probably cache quite well, but that's hardly a complete relief) so it's a fairly poor solution. You could improve it by divide and conquer by doing the same thing in bands — e.g. for the vertical step, independently sum individual rows of 8, after which the error in every row below the final is the failure to include whatever the sums are on that row. So perform a second pass to propagate those.
However an issue with accumulating in the frame buffer is clamping to avoid overflow — if you're expecting a value greater than 255 anywhere in the integral image then you're out of luck because the additive blending will clamp and GL_RG32I et al don't reach ES prior to 3.0.
The best solution I can think of to that, without using any vendor-specific extensions, is to split up the bits of your source image and combine channels after the fact. Supposing your source image were 4 bit and your image less than 256 pixels in both directions, you'd put one bit each in the R, G, B and A channels, perform the normal additive step, then run a quick recombine shader as value = A + (B*2) + (G*4) + (R*8). If your texture is larger or smaller in size or bit depth then scale up or down accordingly.
(platform specific observation: if you're on iOS then you've hopefully already got a CVOpenGLESTextureCache in the loop, which means you have CPU and GPU access to the same texture store, so you might well prefer to kick this step off to GCD. iOS is amongst the platforms supporting EXT_shader_framebuffer_fetch; if you have access to that then you can write any old blend function you like and at least ditch the combination step. Also you're guaranteed that preceding geometry has completed before you draw so if each strip writes its totals where it should and also to the line below then you can perform the ideal two-pixel-strips solution with no intermediate buffers or state changes)
What you attempt to do cannot be done in a fragment shader. GPU's are by nature very different to CPU's by executing their instructions in parallel, in massive numbers at the same time. Because of this, OpenGL does not make any guarantees about execution order, because the hardware physically doesn't allow it to.
So there is not really any defined order other than "whatever the GPU thread block scheduler decides".
Fragments are pixels, sorta-kinda. They are pixels that potentially end up on screen. If another triangle ends up in front of another, the previous calculated colour value is discarded. This happens regardless of whatever colour was stored at that pixel in the colour buffer previously.
As for creating the summed area table on the GPU, I think you may first want to look at GLSL "Compute Shaders", which are specifically made for this sort of thing.
I think you may be able to get this to work by creating a single thread for each row of pixels in the table, then have every thread "lag behind" by 1 pixel compared to the previous row.
In pseudocode:
int row_id = thread_id()
for column_index in (image.cols + image.rows):
int my_current_column_id = column_index - row_id
if my_current_column_id >= 0 and my_current_column_id < image.width:
// calculate sums
The catch of this method is that all threads should be guaranteed to execute their instructions simultaneously without getting ahead of one another. This is guaranteed in CUDA, but I'm not sure whether it is in OpenGL compute shaders. It may be a starting point for you, though.
It may look surprising for the beginner but the prefix sum or SAT calculation is suitable for parallelization. As the Hensley algorithm is the most intuitive to understand (also implemented in OpenGL), more work-efficient parallel methods are available, see CUDA scan. The paper from Sengupta discuss parallel method which seems state-of-the-art efficient method with reduce and down swap phases. These are valuable materials but they do not enter OpenGL shader implementations in detail. The closest document is the presentation you have found (it refers to Hensley publication), since it has some shader snippets. This is the job which is doable entirely in fragment shader with FBO Ping-Pong. Note that the FBO and its texture need to have internal format set to high precision - GL_RGB32F would be best but I am not sure if it is supported in OpenGL ES 2.0.

Fast approximate algorithm for RGB/LAB conversion?

I am working on a data visualization tool using OpenGL, and the LAB color space is the most comprehensible color space for visualization of the data I'm dealing with (3 axes of data are mapped to the 3 axes of the color space). Is there a fast (e.g. no non-integer exponentiation, suitable for execution in a shader) algorithm for approximate conversion of LAB values to and from RGB values?
If doing the actual conversion calculation in a shader is too complex/expensive, you can always use a lookup table. Since both color spaces have 3 components, you can use a 3D RGB texture to represent the lookup table.
Using a 3D texture might sound like a lot of overhead. Since 8 bits/component is often used to represent colors in OpenGL, you would need a 256x256x256 3D texture. At 4 bytes/texel, that's a 64 MByte texture, which is not outrageous, but very substantial.
However, depending on how smooth the values in the translation table are, you might be able to get away with a lower resolution. Keep in mind that texture sampling uses linear interpolation. If piecewise linear interpolation is good enough with a certain base-resolution of the lookup table, you can greatly reduce the size.
If you go this direction, and can't afford to use 64 MBytes for the LUT, you'll have to play with the size of the LUT, and make a possible size/performance vs. quality tradeoff.

Storing SRT values for skeletal animation

I'm currently developing a skeletal animation system for my current project. I guess I understand how it works after doing a lot of reading but while looking at the source of some different projects, I was wondering however why the scale, rotation and translation values of each bone are stored seperately. You could store all of them within a single matrix right? Wouldn't that be more efficient, or would you need more math to get the seperate values rendering it less efficient?
Also while I'm at it, apparently there are 2 common ways of storing the values, which is by matrices, or using vectors and quaternions. The latter one being used more frequently to avoid gimbal lock. However my project has only 2 degrees of freedom. What would be the most efficient way to store my values?
No, a matrix wouldn't be more efficient because you would have to use a 4x4 matrix, so 16 floats. This is mainly because of the way rotation + translation are stored in the matrix.
If you store the values in SRT form you end up with 9 floats, as the rotation quaternion's w component can be re-computed from the others on load.
Moreover many game engines do not support non-uniform scaling, so you can shrink to 1 float for the scale and end up with 8 floats per bones !
And that's before compression: since you know that bones won't move past a certain point (they are volume-bound) then there is no point allocating precision to those ranges it will never reach, so you can encode those floats down to say 16-bit, so you end up with 4 floats per bones.
Now to be fair, I've never implemented that last part with compression; it seemed a bit extreme to me and I didn't have the time.
But going from 64 bytes per bones to 32 bytes per bone is 50% saving !

Resources