OpenGL Textures Format Restrictions - opengl-es

Are there certain format restrictions that textures need to adhere too?
I am loading TGA files and drawing them with the following fragment shader:
varying vec2 v_texCoord;
uniform sampler2D s_texture;
uniform vec4 vColor4;
void main()
{
vec4 tmpColor = texture2D( s_texture, v_texCoord );
tmpColor.r = vColor4.r;
tmpColor.g = vColor4.g;
tmpColor.b = vColor4.b;
gl_FragColor = tmpColor;
}
I find that 16x16 images display OK. 64x16 display OK. 72x16, 80x16 and 96x16 doesn't work.
I will provide more information including the TGA files if needed.

72, 80 and 96 are not powers-of-two; this requirement has little to do with data format in OpenGL ES. This requirement is actually pervasive even in modern desktop GL, where it may depend on the data format used.
Uncompressed texture data in (desktop) OpenGL 2.0 or greater can have non-power-of-two dimensions.
However, compressed texture data continues to require block sizes that are multiples of 4, pixel transfer functions continue to assume 4-byte data alignment for each row in an image, floating-point textures, if supported may also require powers of two, and so on.
Many image libraries designed for GL will actually rescale stuff to a power-of-two, which can solve every one of the problems discussed above. It's not always the most appropriate way (it can be extremely wasteful) to fix dimension problems but it can be applied universally to just about any common dimension problem.

Related

Optimization ideas : apply a LUT (lookup table) on an image

I currently work on some project using LUTs to modify colors of images.
My problem is that my program is not optimized...
What my program does:
* Opens a LUT file (.cube) and stores the values in the memory
* On each pixel of the image, trilinear interpolation is used to change the colors using the LUT
What I've tried:
* Downscaling the image, but the process still takes so much time...
How do programs such as Premiere pro or Davinci Resolve can apply a LUT to a footage and read it at 24fps? My program takes 10s to apply a LUT on a jpg/DNG file !
The most efficient way to do this would be in the GPU, which can do many simple interpolation and lookup instructions simultaneously on many pixels.
This article: https://developer.nvidia.com/gpugems/GPUGems2/gpugems2_chapter24.html
describes the algorithm for you, and it's simple enough it's trivial to port it over to OpenGL or another GPU scripting language:
void main(in float2 sUV : TEXCOORD0,
out half4 cOut : COLOR0,
const uniform samplerRECT imagePlane,
const uniform sampler3D lut,
const uniform float3 lutSize)
{
// get raw RGB pixel values
half3 rawColor = texRECT(imagePlane, sUV).rgb;
// calculate scale and offset values
half3 scale = (lutSize - 1.0) / lutSize;
half3 offset = 1.0 / (2.0 * lutSize);
// apply the LUT
cOut.rgb = tex3D(lut, scale * rawColor + offset);
}
Outside of that you will have to load the LUT as a uniform array into the GPU with your application code, and then stream every video frame into the GPU so it can pass it through your fragment shader in a render/work loop. This is most likely what professional video editing programs do, in order to apply LUTs with realtime video constraints.
P.S. harold's comment about precalculating the lookup entries is also a valid way to speed your process up, making the operation purely a memory access with the lookup. It'll still probably be orders of magnitude less efficient than GPU processing because of how much slower CPU memory access is compared to what the GPU does, and it's very memory inefficient, depending on the system you do it on and the dimensionality and size of your LUT.
For example, let's say you want to make the 'full' 3D LUT for 24-bit RGB. That means your final cube needs to have an edge of size 255, meaning your final size is this: 255^3 * 3 (RGB) * 2 (float) bytes, for a total of nearly 100MB. Obviously if it is just a 1D LUT this might not be an issue, or with lower color bit-depth, however this method is still inefficient compared to letting the GPU handle the interpolation for you.

Summed area table in GLSL and GPU fragment shader execution

I am trying to compute the integral image (aka summed area table) of a texture I have in the GPU memory (a camera capture), the goal being to compute the adaptive threshold of said image. I'm using OpenGL ES 2.0, and still learning :).
I did a test with a simple gaussian blur shader (vertical/horizontal pass), which is working fine, but I need a way bigger variable average area for it to give satisfactory results.
I did implement a version of that algorithm on CPU before, but I'm a bit confused on how to implement that on a GPU.
I tried to do a (completely incorrect) test with just something like this for every fragment :
#version 100
#extension GL_OES_EGL_image_external : require
precision highp float;
uniform sampler2D u_Texture; // The input texture.
varying lowp vec2 v_TexCoordinate; // Interpolated texture coordinate per fragment.
uniform vec2 u_PixelDelta; // Pixel delta
void main()
{
// get neighboring pixels values
float center = texture2D(u_Texture, v_TexCoordinate).r;
float a = texture2D(u_Texture, v_TexCoordinate + vec2(u_PixelDelta.x * -1.0, 0.0)).r;
float b = texture2D(u_Texture, v_TexCoordinate + vec2(0.0, u_PixelDelta.y * 1.0)).r;
float c = texture2D(u_Texture, v_TexCoordinate + vec2(u_PixelDelta.x * -1.0, u_PixelDelta.y * 1.0)).r;
// compute value
float pixValue = center + a + b - c;
// Result stores value (R) and original gray value (G)
gl_FragColor = vec4(pixValue, center, center, 1.0);
}
And then another shader to get the area that I want and then get the average. This is obviously wrong as there's multiple execution units operating at the same time.
I know that the common way of computing a prefix sum on a GPU is to do it in two pass (vertical/horizontal, as discussed here on this thread or or here), but isn't there a problem here as there is a data dependency on each cell from the previous (top or left) one ?
I can't seem to understand the order in which the multiple execution units on a GPU will process the different fragments, and how a two-pass filter can solve that issue. As an example, if I have some values like this :
2 1 5
0 3 2
4 4 7
The two pass should give (first columns then rows):
2 1 5 2 3 8
2 4 7 -> 2 6 13
6 8 14 6 14 28
How can I be sure that, as an example, the value [0;2] will be computed as 6 (2 + 4) and not 4 (0 + 4, if the 0 hasn't been computed yet) ?
Also, as I understand that fragments are not pixels (If I'm not mistaken), would the values I store back in one of my texture in the first pass be the same in another pass if I use the exact same coordinates passed from the vertex shader, or will they be interpolated in some way ?
Tommy and Bartvbl address your questions about a summed-area table, but your core problem of an adaptive threshold may not need that.
As part of my open source GPUImage framework, I've done some experimentation with optimizing blurs over large radii using OpenGL ES. Generally, increasing blur radii leads to a significant increase in texture sampling and calculations per pixel, with an accompanying slowdown.
However, I found that for most blur operations you can apply a surprisingly effective optimization to cap the number of blur samples. If you downsample the image before blurring, blur at a smaller pixel radius (radius / downsampling factor), and then linearly upsample, you can arrive at a blurred image that is the equivalent of one blurred at a much larger pixel radius. In my tests, these downsampled, blurred, and then upsampled images look almost identical to the ones blurred based on the original image resolution. In fact, precision limits can lead to larger-radii blurs done at a native resolution breaking down in image quality past a certain size, where the downsampled ones maintain the proper image quality.
By adjusting the downsampling factor to keep the downsampled blur radius constant, you can achieve near constant-time blurring speeds in the face of increasing blur radii. For a adaptive threshold, the image quality should be good enough to use for your comparisons.
I use this approach in the Gaussian and box blurs within the latest version of the above-linked framework, so if you're running on Mac, iOS, or Linux, you can evaluate the results by trying out one of the sample applications. I have an adaptive threshold operation based on a box blur that uses this optimization, so you can see if the results there are what you want.
AS per the above, it's not going to be fantastic on a GPU. But assuming the cost of shunting data between the GPU and CPU is more troubling it may still be worth persevering.
The most obvious prima facie solution is to split horizontal/vertical as discussed. Use an additive blending mode, create a quad that draws the whole source image then e.g. for the horizontal step on a bitmap of width n issue a call that requests the quad be drawn n times, the 0th time at x = 0, the mth time at x = m. Then ping pong via an FBO, switching the target of buffer of the horizontal draw into the source texture for the vertical.
Memory accesses are probably O(n^2) (i.e. you'll probably cache quite well, but that's hardly a complete relief) so it's a fairly poor solution. You could improve it by divide and conquer by doing the same thing in bands — e.g. for the vertical step, independently sum individual rows of 8, after which the error in every row below the final is the failure to include whatever the sums are on that row. So perform a second pass to propagate those.
However an issue with accumulating in the frame buffer is clamping to avoid overflow — if you're expecting a value greater than 255 anywhere in the integral image then you're out of luck because the additive blending will clamp and GL_RG32I et al don't reach ES prior to 3.0.
The best solution I can think of to that, without using any vendor-specific extensions, is to split up the bits of your source image and combine channels after the fact. Supposing your source image were 4 bit and your image less than 256 pixels in both directions, you'd put one bit each in the R, G, B and A channels, perform the normal additive step, then run a quick recombine shader as value = A + (B*2) + (G*4) + (R*8). If your texture is larger or smaller in size or bit depth then scale up or down accordingly.
(platform specific observation: if you're on iOS then you've hopefully already got a CVOpenGLESTextureCache in the loop, which means you have CPU and GPU access to the same texture store, so you might well prefer to kick this step off to GCD. iOS is amongst the platforms supporting EXT_shader_framebuffer_fetch; if you have access to that then you can write any old blend function you like and at least ditch the combination step. Also you're guaranteed that preceding geometry has completed before you draw so if each strip writes its totals where it should and also to the line below then you can perform the ideal two-pixel-strips solution with no intermediate buffers or state changes)
What you attempt to do cannot be done in a fragment shader. GPU's are by nature very different to CPU's by executing their instructions in parallel, in massive numbers at the same time. Because of this, OpenGL does not make any guarantees about execution order, because the hardware physically doesn't allow it to.
So there is not really any defined order other than "whatever the GPU thread block scheduler decides".
Fragments are pixels, sorta-kinda. They are pixels that potentially end up on screen. If another triangle ends up in front of another, the previous calculated colour value is discarded. This happens regardless of whatever colour was stored at that pixel in the colour buffer previously.
As for creating the summed area table on the GPU, I think you may first want to look at GLSL "Compute Shaders", which are specifically made for this sort of thing.
I think you may be able to get this to work by creating a single thread for each row of pixels in the table, then have every thread "lag behind" by 1 pixel compared to the previous row.
In pseudocode:
int row_id = thread_id()
for column_index in (image.cols + image.rows):
int my_current_column_id = column_index - row_id
if my_current_column_id >= 0 and my_current_column_id < image.width:
// calculate sums
The catch of this method is that all threads should be guaranteed to execute their instructions simultaneously without getting ahead of one another. This is guaranteed in CUDA, but I'm not sure whether it is in OpenGL compute shaders. It may be a starting point for you, though.
It may look surprising for the beginner but the prefix sum or SAT calculation is suitable for parallelization. As the Hensley algorithm is the most intuitive to understand (also implemented in OpenGL), more work-efficient parallel methods are available, see CUDA scan. The paper from Sengupta discuss parallel method which seems state-of-the-art efficient method with reduce and down swap phases. These are valuable materials but they do not enter OpenGL shader implementations in detail. The closest document is the presentation you have found (it refers to Hensley publication), since it has some shader snippets. This is the job which is doable entirely in fragment shader with FBO Ping-Pong. Note that the FBO and its texture need to have internal format set to high precision - GL_RGB32F would be best but I am not sure if it is supported in OpenGL ES 2.0.

Luminance 'y' value of surface texture

For my opengl video player app, I am using surface texture bound to GL_TEXTURE_EXTERNAL_OES
source : https://github.com/crossle/MediaPlayerSurface/blob/master/src/me/crossle/demo/surfacetexture/VideoSurfaceView.java
In my fragment shader, I want luminance value to be taken for 3x3 block.
vec2 tex00 = vec2(vTextureCoord.x-xmargin, vTextureCoord.y-ymargin)
vec4 p00 = texture2D(sTexture, tex00)
... etc for 3x3
and then calculate luminance of each texel : ie: p00 by doing dot of p00.rgb with vec3 of (0.3,0.59,0.11).
Instead is it possible to directly use p00.y ? Will it give luminance value?
No, p00.y is the same as p00.g (or p00.t). They are different ways to access the second component (green channel) of your vector. You can access components as XYZW, RGBA, or STPQ, and there is no difference between them.
The only reason that people use .rgb instead of .xyz is to make it easier for humans to read.
No, but maybe close enough for your use case. By just using y you will miss some cases though. A pure red image would be 0. You can add up all 9 samples though and do one dot product on the result.

How can I debug a GLSL shader?

How can I debug an OpenGL shader? For example.:
void main(void)
{
vec2 uv = gl_FragCoord.xy;
gl_FragColor = vec4(uv,222,1);
}
Is there a way that I can find out what the uv value is?
Debugging a GLSL Shader
There is many way to debug a shader, most way are visual rather than outputting the full pixel data of the whole image.
Since shader run in a highly parallel way you can output a lot of visual data at once.
There is also some external application that can help you a lot in debugging glsl shader and also the whole rendering pipeline.
visual debugging
This is the simplest form of debugging while the hardest to understand the result. You just have to output the data you want to see to the screen, for example to know what are the value of the uv like you wanted.
You could do it like this :
void mainImage( out vec4 fragColor, in vec2 fragCoord )
{
// Normalized pixel coordinates (from 0 to 1)
vec2 uv = fragCoord/iResolution.xy;
// Output to screen
fragColor = vec4(uv, 0.0 ,1.0);
}
In this picture you can see that you have a range from 0 to 1 for the normalized uv. the yellow color mean uv are vec2(1.0,1.0) and black mean vec2(0.0,0.0).
another exampler of visual debugging :
(source : https://www.shadertoy.com/view/ts2yzK)
This one is from a raymarching project. there is two thing going one in this image, i debug the depth of the ray and also debug if there is a hit or not.
This image go to see what you have hit or not since white mean hit and black mean miss.
One the other hand this image is bad to display the depth of image. You can have hard time debugging images likes this because of two things :
negative value who are render black
value superior to 1.0 who are render white
Value like that lead us to the other type of debugging
external application
https://renderdoc.org/
Renderdoc is one of the best debugger out there. Yes there is many other but This one is free to use and has a very wide range of usage.
You can debug vertex glsl shader by seeing the before and after the shader.
You can see every pixel of a fragment shader and there value.
You can also see how you data a stored on the GPU and many more things.
note for the end
While debugging, visual or not you have to know what you are looking for, you at least have a hit. Visual debugging is one of the hardest thing to debug since usually there is no way to perform you code step by step because of the high parallelism of shaders.
Many bugs are also because we forgot small things, like normalising a value, forgetting a minus sign and more small things.
To spot error and properly debug you have to be thorough and be patient.

When using the same Vertex shader in different programs, does the uniform location persist

sorry if this is a duplicate I can't seem to find a solid answer.
If i use the same vertex shader in multiple programs is it safe to assume the getUniformLocation will stay the same?
example, if i use the following vertex shader in multiple programs (A,B,C,D):
uniform mat4 uMvp;
attribute vec3 aPosition;
void main() {
vec4 position = vec4(aPosition.xyz, 1.);
gl_Position = uMvp * position;
}
and at initialization I was to call
GLUint mvpLoc = getUniformLocation("uMvp");
while using program A, would i safely be able to switch to program B/C/D and carry on using mvploc? I am relatively new to GLES 2.0 and on the face of it this seems like bad practice but I assume there is overhead when calling getUniformLocation that would be good to avoid.
I have read about glBindAttribLocation so i could use that and instead have uMvp as a attribute but then I feel like i am missing a point as the common practice seems to be mvps as uniform variables.
No, each program program object will have separate uniform locations. There's no way to guarantee that two different programs use the same location for the same uniform.
Unless you have access to ARB_explicit_uniform_locations or GL 4.3 (which you don't, since you're using ES). This allows you to explicitly specify uniform locations in the shader.

Resources