glGetQueryObjectuiv, "Bound query buffer is not large enough to store result." - performance

I am trying to solve an error I get when I run this sample.
It regards query occlusion, essentially it renders four times a square changing everytime viewport but only the central two times it will actually render something since the first and the last viewport are outside the monitor area on purpose.
viewports[0] = new Vec4(windowSize.x * -0.5f, windowSize.y * -0.5f, windowSize.x * 0.5f, windowSize.y * 0.5f);
viewports[1] = new Vec4(0, 0, windowSize.x * 0.5f, windowSize.y * 0.5f);
viewports[2] = new Vec4(windowSize.x * 0.5f, windowSize.y * 0.5f, windowSize.x * 0.5f, windowSize.y * 0.5f);
viewports[3] = new Vec4(windowSize.x * 1.0f, windowSize.y * 1.0f, windowSize.x * 0.5f, windowSize.y * 0.5f);
Each of this time, it will glBeginQuery with a different query and render a first time and then I query GL_ANY_SAMPLES_PASSED
// Samples count query
for (int i = 0; i < viewports.length; ++i) {
gl4.glViewportArrayv(0, 1, viewports[i].toFA_(), 0);
gl4.glBeginQuery(GL_ANY_SAMPLES_PASSED, queryName.get(i));
{
gl4.glDrawArraysInstanced(GL_TRIANGLES, 0, vertexCount, 1);
}
gl4.glEndQuery(GL_ANY_SAMPLES_PASSED);
}
Then I try to read the result
gl4.glBindBuffer(GL_QUERY_BUFFER, bufferName.get(Buffer.QUERY));
IntBuffer params = GLBuffers.newDirectIntBuffer(1);
for (int i = 0; i < viewports.length; ++i) {
params.put(0, i);
gl4.glGetQueryObjectuiv(queryName.get(i), GL_QUERY_RESULT, params);
}
But I get:
GlDebugOutput.messageSent(): GLDebugEvent[ id 0x502
type Error
severity High: dangerous undefined behavior
source GL API
msg GL_INVALID_OPERATION error generated. Bound query buffer is not large enough to store result.
when 1455696348371
source 4.5 (Core profile, arb, debug, compat[ES2, ES3, ES31, ES32], FBO, hardware) - 4.5.0 NVIDIA 356.39 - hash 0x238337ea]
If I look on the api doc they say:
params
If a buffer is bound to the GL_QUERY_RESULT_BUFFER target, then params is treated as an offset to a location within that buffer's data store to receive the result of the query. If no buffer is bound to GL_QUERY_RESULT_BUFFER, then params is treated as an address in client memory of a variable to receive the resulting data.
I guess there is an error in that phrase, I think they meant GL_QUERY_BUFFER instead of GL_QUERY_RESULT_BUFFER, indeed they use GL_QUERY_BUFFER also here for example..
Anyway, if anything is bound there, then params is interpreted as offset, ok
but my buffer is big enough:
gl4.glBindBuffer(GL_QUERY_BUFFER, bufferName.get(Buffer.QUERY));
gl4.glBufferData(GL_QUERY_BUFFER, Integer.BYTES * queryName.capacity(), null, GL_DYNAMIC_COPY);
gl4.glBindBuffer(GL_QUERY_BUFFER, 0);
So what's the problem?
I tried to write a big number, such as 500, for the buffer size, but no success...
I guess the error lies somewhere else.. could you see it?

if I have to answer, I say I expect that if I bind a buffer to GL_QUERY_BUFFER target, then OpenGL should read the value inside params and interpreter that as the offset (in bytes) where it should save the result of the query to.
No, that's not how it works.
In C/C++, the value taken by glGetQueryObject is a pointer, which normally is a pointer to a client memory buffer. For this particular function, this would often be a stack variable:
GLuint val;
glGetQueryObjectuiv(obj, GL_QUERY_RESULT, &val);
val is declared by client code (ie: the code calling into OpenGL). This code passes a pointer to that variable, and glGetQueryObjectuiv will write data to this pointer.
This is emulated in C# bindings by using *Buffer types. These represent contiguous arrays of values from which C# can extract a pointer that is compatible with C and C++ pointers-to-arrays.
However, when a buffer is bound to GL_QUERY_BUFFER, the meaning of the parameter changes. As you noted, it goes from being a client pointer to memory into an offset. But please note what that says. It does not say a "client pointer to an offset".
That is, the pointer value itself ceases being a pointer to actual memory. Instead, the numerical value of the pointer is treated as an offset.
In C++ terms, that's this:
glBindBuffer(GL_QUERY_BUFFER, buff);
glGetQueryObjectuiv(obj, GL_QUERY_RESULT, reinterpret_cast<void*>(16));
Note how it takes the offset of 16 bytes and pretends that this value is actually a void* who's numerical value is 16. That's what the reinterpret cast does.
How do you do that in C#? I have no idea; it would depend on the binding you're using, and you never specified what that was. Tao's long-since dead, and OpenTK looks to be heading that way too. But I did find out how to do this in OpenTK.
What you need to do is this:
gl4.glBindBuffer(GL_QUERY_BUFFER, bufferName.get(Buffer.QUERY));
for (int i = 0; i < viewports.length; ++i)
{
gl4.glGetQueryObjectuiv(queryName.get(i), GL_QUERY_RESULT,
(IntPtr)(i * Integer.BYTES));
}
You multiply times Integer.BYTES because the value is a byte offset into the buffer, not the integer index into an array of ints.

Related

Improving the performance of Webgl2 texSubImage2D call with large texture

Using WebGL2 I stream a 4K by 2K stereoscopic video as a texture onto the inside of a sphere in order to provide 360° VR video playback capability. I've optimized as much of the codebase as is feasible given the returns on time and the application runs flawlessly when using an .H264 video source.
However; when using 8bit VP8 or VP9 (which offer superior fidelity and file size, AV1 isn't available to me) I encounter FPS drops on weaker systems due to the extra CPU requirements for decoding VP8/VP9 video.
When profiling the app, I've identified that the per-frame call of texSubImage2D that updates the texture from the video consumes the large majority of each frame (texImage2D was even worse due to it's allocations), but am unsure how to further optimize it's use. Below are the things I'm already doing to minimize it's impact:
I cache the texture's memory space at initial load using texStorage2D to keep it as contiguous as possible.
let glTexture = gl.createTexture();
let pixelData = new Uint8Array(4096*2048*3);
pixelData.fill(255);
gl.bindTexture(GL.TEXTURE_2D, glTexture);
gl.texStorage2D(GL.TEXTURE_2D, 1, GL.RGB8, 4096, 2048);
gl.texSubImage2D(GL.TEXTURE_2D, 0, 0, 0, 4096, 2048, GL.RGB, GL.RGB, pixelData);
gl.generateMipmap(GL.TEXTURE_2D);
Then, during my render loop, both left and right eye-poses are processed for each object before moving on to the next object. This allows me to only need to call gl.bindTexture and gl.texSubImage2D once per object per frame. Additionally I also, skip populating shader program defines if the material for this entity is the same as the one for the previous entity, the video is paused, or still loading.
/* Main Render Loop Extract */
//Called each frame after pre-sorting entities
function DrawScene(glLayer, pose, scene){
//Entities are pre-sorted for transparency blending, rendering opaque first, and transparent second.
for (let ii = 0; ii < _opaqueEntities.length; ii++){
//Only render if entity and it's parent chain are active
if(_opaqueEntities[ii] && _opaqueEntities[ii].isActiveHeirachy){
for (let i = 0; i < pose.views.length; i++) {
_RenderEntityView(pose, i, _opaqueEntities[ii]);
}
}
}
for (let ii = 0; ii < _transparentEntities.length; ii++) {
//Only render if entity and it's parent chain are active
if(_transparentEntities[ii] && _transparentEntities[ii].isActiveHeirachy){
for (let i = 0; i < pose.views.length; i++) {
_RenderEntityView(pose, i, _transparentEntities[ii]);
}
}
}
}
let _programData;
function _RenderEntityView(pose, viewIdx, entity){
//Calculates/manipualtes view matrix for entity for this view. (<0.1ms)
//...
//Store reference to make stack overflow lines shorter :-)
_programData = entity.material.shaderProgram;
_BindEntityBuffers(entity, _programData);//The buffers Thomas, mind the BUFFERS!!!
gl.uniformMatrix4fv(
_programData.uniformData.uProjectionMatrix,
false,
_view.projectionMatrix
);
gl.uniformMatrix4fv(
_programData.uniformData.uModelViewMatrix,
false,
_modelViewMatrix
);
//Render all triangles that make up the object.
gl.drawElements(GL.TRIANGLES, entity.tris.length, GL.UNSIGNED_SHORT, 0);
}
let _attrName;
let _attrLoc;
let textureData;
function _BindEntityBuffers(entity, programData){
gl.useProgram(programData.program);
//Binds pre-defined shader atributes on an as needed basis
for(_attrName in programData.attributeData){
_attrLoc = programData.attributeData[_attrName];
//Bind only if exists in shader
if(_attrLoc.key >= 0){
_BindShaderAttributes(_attrLoc.key, entity.attrBufferData[_attrName].buffer,
entity.attrBufferData[_attrName].compCount);
}
}
//Bind triangle index buffer
gl.bindBuffer(GL.ELEMENT_ARRAY_BUFFER, entity.triBuffer);
//If already in use, is instanced material so skip configuration.
if(_materialInUse == entity.material){return;}
_materialInUse = entity.material;
//Use the material by applying it's specific uniforms
//Apply base color
gl.uniform4fv(programData.uniformData.uColor, entity.material.color);
//If shader uses a difuse texture
if(programData.uniformData.uDiffuseSampler){
//Store reference to make stack overflow lines shorter :-)
textureData = entity.material.diffuseTexture;
gl.activeTexture(gl.TEXTURE0);
//Use assigned texture
gl.bindTexture(gl.TEXTURE_2D, textureData);
//If this is a video, update the texture buffer using the current video's playback frame data
if(textureData.type == TEXTURE_TYPE.VIDEO &&
textureData.isLoaded &&
!textureData.paused){
//This accounts for 42% of all script execution time!!!
gl.texSubImage2D(gl.TEXTURE_2D, textureData.level, 0, 0,
textureData.width, textureData.height, textureData.internalFormat,
textureData.srcType, textureData.video);
}
gl.uniform1i(programData.uniformData.uDiffuseSampler, 0);
}
}
function _BindShaderAttributes(attrKey, buffer, compCount, type=GL.FLOAT, normalize=false, stride=0, offset=0){
gl.bindBuffer(GL.ARRAY_BUFFER, buffer);
gl.vertexAttribPointer(attrKey, compCount, type, normalize, stride, offset);
gl.enableVertexAttribArray(attrKey);
}
I've contemplated using pre-defined counters for all for loops to avoid the var i=0; allocation, but the gain from that seems hardly worth the effort.
Side Note, The source video is actually larger than 4K, but anything above 4K and FPS grinds to about 10-12.
Obligatory: The key functionality above is extracted from a larger WebGL rendering framework I wrote that itself runs pretty damn fast already. The reason I'm not 'just using' Three, AFrame, or other such common libraries is that they do not have an ATO from the DOD, whereas in-house developed code is ok.
Update 9/9/21: At some point when chrome updated from 90 to 93 the WebGL performance of texSubImage2D dropped dramatically, resulting in +100ms per frame execution regardless of CPU/GPU capability. Changing to use texImage2D now results in around 16ms per frame. In addition shifting from RGB to RGB565 offers up a few ms of performance while minimally sacrificing color.
I'd still love to hear from GL/WebGL experts as to what else I can do to improve performance.

How To Set Up Byte Alignment From a MTLBuffer to a 2D MTLTexture?

I have an array of float values that represents a 2D image (think from a CCD) that I ultimately want to render into a MTLView. This is on macOS, but I'd like to be able to apply the same to iOS at some point. I initially create an MTLBuffer with the data:
NSData *floatData = ...;
id<MTLBuffer> metalBuffer = [device newBufferWithBytes:floatData.bytes
length:floatData.length
options:MTLResourceCPUCacheModeDefaultCache | MTLResourceStorageModeManaged];
From here, I run the buffer through a few compute pipelines. Next, I want to create an RGB MTLTexture object to pass to a few CIFilter/MPS filters and then display. It seems to make sense to create a texture that uses the already created buffer as backing to avoid making another copy. (I've successfully used textures with a pixel format of MTLPixelFormatR32Float.)
// create texture with scaled buffer - this is a wrapper, i.e. it shares memory with the buffer
MTLTextureDescriptor *desc;
desc = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat:MTLPixelFormatR32Float
width:width
height:height
mipmapped:NO];
desc.usage = MTLResourceUsageRead;
desc.storageMode = scaledBuffer.storageMode; // must match buffer
id<MTLTexture> scaledTexture = [scaledBuffer newTextureWithDescriptor:desc
offset:0
bytesPerRow:imageWidth * sizeof(float)];
The image dimensions are 242x242. When I run this I get:
validateNewTexture:89: failed assertion `BytesPerRow of a buffer-backed
texture with pixelFormat(MTLPixelFormatR32Float) must be aligned to 256 bytes,
found bytesPerRow(968)'
I know I need to use:
NSUInteger alignmentBytes = [self.device minimumLinearTextureAlignmentForPixelFormat:MTLPixelFormatR32Float];
How do I define the buffer such that the bytes are properly aligned?
More generally, is this the appropriate approach for this kind of data? This is the stage where I effectively convert the float data into something that has color. To clarify, this is my next step:
// render into RGB texture
MPSImageConversion *imageConversion = [[MPSImageConversion alloc] initWithDevice:self.device
srcAlpha:MPSAlphaTypeAlphaIsOne
destAlpha:MPSAlphaTypeAlphaIsOne
backgroundColor:nil
conversionInfo:NULL];
[imageConversion encodeToCommandBuffer:commandBuffer
sourceImage:scaledTexture
destinationImage:intermediateRGBTexture];
where intermediateRGBTexture is a 2D texture defined with MTLPixelFormatRGBA16Float to take advantage of EDR.
If it's important to you that the texture share the same backing memory as the buffer, and you want the texture to reflect the actual image dimensions, you need to ensure that the data in the buffer is correctly aligned from the start.
Rather than copying the source data all at once, you need to ensure the buffer has room for all of the aligned data, then copy it one row at a time.
NSUInteger rowAlignment = [self.device minimumLinearTextureAlignmentForPixelFormat:MTLPixelFormatR32Float];
NSUInteger sourceBytesPerRow = imageWidth * sizeof(float);
NSUInteger bytesPerRow = AlignUp(sourceBytesPerRow, rowAlignment);
id<MTLBuffer> metalBuffer = [self.device newBufferWithLength:bytesPerRow * imageHeight
options:MTLResourceCPUCacheModeDefaultCache];
const uint8_t *sourceData = floatData.bytes;
uint8_t *bufferData = metalBuffer.contents;
for (int i = 0; i < imageHeight; ++i) {
memcpy(bufferData + (i * bytesPerRow), sourceData + (i * sourceBytesPerRow), sourceBytesPerRow);
}
Where AlignUp is your alignment function or macro of choice. Something like this:
static inline NSUInteger AlignUp(NSUInteger n, NSInteger alignment) {
return ((n + alignment - 1) / alignment) * alignment;
}
It's up to you to determine whether the added complexity is worth saving a copy, but this is one way to achieve what you want.

Reading Pixels in WebGL 2 as Float values

I need to read the pixels of my framebuffer as float values.
My goal is to get a fast transfer of lots of particles between CPU and GPU and process them in realtime. For that I store the particle properties in a floating point texture.
Whenever a new particle is added, I want to get the current particle array back from the texture, add the new particle properties and then fit it back into the texture (this is the only way I could think of to dynamically add particles and process them GPU-wise).
I am using WebGL 2 since it supports reading back pixels to a PIXEL_PACK_BUFFER target. I test my code in Firefox Nightly. The code in question looks like this:
// Initialize the WebGLBuffer
this.m_particlePosBuffer = gl.createBuffer();
gl.bindBuffer(gl.PIXEL_PACK_BUFFER, this.m_particlePosBuffer);
gl.bindBuffer(gl.PIXEL_PACK_BUFFER, null);
...
// In the renderloop, bind the buffer and read back the pixels
gl.bindBuffer(gl.PIXEL_PACK_BUFFER, this.m_particlePosBuffer);
gl.readBuffer(gl.COLOR_ATTACHMENT0); // Framebuffer texture is bound to this attachment
gl.readPixels(0, 0, _texSize, _texSize, gl.RGBA, gl.FLOAT, 0);
I get this error in my console:
TypeError: Argument 7 of WebGLRenderingContext.readPixels could not be converted to any of: ArrayBufferView, SharedArrayBufferView.
But looking at the current WebGL 2 Specification, this function call should be possible. Using the type gl.UNSIGNED_BYTE also returns this error.
When I try to read the pixels in an ArrayBufferView (which I want to avoid since it seems to be way slower) it works with the format/type combination of gl.RGBA and gl.UNSIGNED_BYTE for a Uint8Array() but not with gl.RGBA and gl.FLOAT for a Float32Array() - this is as expected since it's documented in the WebGL Specification.
I am thankful for any suggestions on how to get my float pixel values from my framebuffer or on how to otherwise get this particle pipeline going.
Did you try using this extension?
var ext = gl.getExtension('EXT_color_buffer_float');
The gl you have is webgl1,not webgl2.Try:
var gl = document.getElementById("canvas").getContext('webgl2');
In WebGL2 the syntax for glReadPixel is
void gl.readPixels(x, y, width, height, format, type, ArrayBufferView pixels, GLuint dstOffset);
so
let data = new Uint8Array(gl.drawingBufferWidth * gl.drawingBufferHeight * 4);
gl.readPixels(0, 0, gl.drawingBufferWidth, gl.drawingBufferHeight, gl.RGBA, gl.UNSIGNED_BYTE, pixels, 0);
https://developer.mozilla.org/en-US/docs/Web/API/WebGLRenderingContext/readPixels

Is opencl known to generate corrupt code?

I have a small opencl kernel that writes to a shared GL texture. I have separated different stages of the computation into several functions. Every function gets a pointer to the final color and passes this along if needed. If you look at the code fragment you see a line that is called "UNREACHABLE". For some reason it does get executed. What ever color I put in there appears in the final image. How is that possible?
If I duplicate the same code block below that does not happen. Only for the first one. :(
To make things even funnier, if I change the code above (e.g. add another multiplication) the UNREACHABLE line gets executed at random.
Therefore my questions: Is this a compiler bug? Do I exhaust a certain memory or register budged that I should be aware of? Are OpenCL compilers buggy in general?
void sample(float4 *color) {
...
float4 r_color = get_color(...);
float factor = r_color.w + (*color).w - 1.0f;
r_color = r_color * ((r_color.w - factor) / r_color.w);
*color += r_color;
if(color->w >= 1.0f) {
if(color->w <= 0.0f) {
(*color) = (float4)(0.0f, 0.0f, 0.0f, 1.0f); //UNREACHABLE?
return;
}
}
...
}
...
__kernel void render(
__write_only image2d_t output_buffer,
int width,
int height
) {
uint screen_x = get_global_id(0);
uint screen_y = get_global_id(1);
float4 color = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
sample(&color);
write_imagef(output_buffer, (int2)(screen_x, screen_y), color);
}
My Platform:
Apple
Intel(R) Core(TM) i5-2415M CPU # 2.30GHz
selected device has extensions: cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_APPLE_fp64_basic_ops cl_APPLE_fixed_alpha_channel_orders cl_APPLE_biased_fixed_point_image_formats cl_APPLE_command_queue_priority
[Edit]
After observing the values I get during the calculation I am thinking r_color.w being exactly 0.0f after get_color may cause the problem. I am still looking for a definite statement that says comparing a NaN is not defined or is always true or something.
Also "Is opencl known to generate corrupt code?" has the invisible postfix "or what am I missing".
I used to work with embedded systems where the vendor would provide their own proprietary compilers, which in turn were known to break your code. So I want to get this off the table if possible. I suspect that clang would not do that. But you never know.

NULL checks in GLSL

I'm trying to check inside the shader (GLSL) if my vec4 is NULL. I need this for several reasons, mostly to get specific graphics cards compatible, since some of them pass a previous color in gl_FragColor, and some don't (providing a null vec4 that needs to be overwritten).
Well, on a fairly new Mac, someone got this error:
java.lang.RuntimeException: Error creating shader: ERROR: 0:107: '==' does not operate on 'vec4' and 'int'
ERROR: 0:208: '!=' does not operate on 'mat3' and 'int'
This is my code in the fragment shader:
void main()
{
if(gl_FragColor == 0) gl_FragColor = vec4(0.0, 0.0, 0.0, 0.0); //Line 107
vec4 newColor = vec4(0.0, 0.0, 0.0, 0.0);
[...]
if(inverseViewMatrix != 0) //Line 208
{
[do stuff with it; though I can replace this NULL check with a boolean]
}
[...]
gl_FragColor.rgb = mix(gl_FragColor.rgb, newColor.rgb, newColor.a);
gl_FragColor.a += newColor.a;
}
As you can see, I do a 0/NULL check for gl_FragColor at the start, because some graphics cards pass valuable information there, but some don't. Now, on that special mac, it didn't work. I did some research, but couldn't find any information on how to do a proper NULL check in GLSL. Is there even one, or do I really need to make separate shaders here?
All variables meant for reading, i.e. input variables always deliver sensible values. Being an output variable, gl_FragColor is not one of these variables!
In this code
void main()
{
if(gl_FragColor == 0) gl_FragColor = vec4(0.0, 0.0, 0.0, 0.0); //Line 107
vec4 newColor = vec4(0.0, 0.0, 0.0, 0.0);
The very first thing you do is reading from gl_FragColor. The GLSL specification clearly states, that the value of an output varialbe as gl_FragColoris, is undefined when the fragment shader stage is entered (point 1):
The value of an output variable will be undefined in any of the three following cases:
At the beginning of execution.
At each synchronization point, unless
the value was well-defined after the previous synchronization
point and was not written by any invocation since, or
the value was written by exactly one shader invocation since the
previous synchronization point, or
the value was written by multiple shader invocations since the
previous synchronization point, and the last write performed
by all such invocations wrote the same value.
When read by a shader invocation, if
the value was undefined at the previous synchronization
point and has not been writen by the same shader invocation since, or
the output variable is written to by any other shader
invocation between the previous and next synchronization points,
even if that assignment occurs in code following the read.
Only after an element of an output variable has been written to for the first time its value is defined. So the whole thing you do there makes no sense. That it "didn't work" is completely permissible and an error on your end.
You're invoking undefined behaviour and technically it would be permissible for your computer to become sentinent, chase you down the street and erase all of your data as an alternative reaction to this.
In GLSL a vec4 is a regular datatype just like int. It's not some sort of pointer to an array which could be a null pointer. At best it has some default value that's not being overwritten by a call to glUniform.
Variables in GLSL shaders are always defined (otherwise, you'll get a linker error). If you don't supply those values with data (by not loading the appropriate uniform, or binding attributes to in or attribute variables), the values in those variables will be undefined (i.e., garbage), but present.
Even if you can't have null values, you can test undefined variables. This is a trick that I use to debug my shaders:
...
/* First we test for lower range */
if(suspect_variable.x < 0.5) {
outColour = vec4(0,1,0,0); /* Green if in lower range*/
} else if(suspect_variable.x >= 0.5) { /*Then for the higher range */
outColour = vec4(1,0,0,0); /* Red if in higher range */
} else {
/* Now we have tested for all real values.
If we end up here we know that the value must be undefined */
outColour = vec4(0,0,1,0); /* Blue if it's undefined */
}
You might ask, what could make a variable undefined? Out of range access of an array would cause it to be undefined;
const int numberOfLights = 2;
uniform vec3 lightColour[numberOfLights];
...
for(int i = 0; i < 100; i++) {
/* When i bigger than 1 the suspect_variable would be undefined */
suspect_variable = suspect_variable * lightColour[i];
}
It is a simple and easy trick to use when you do not have access to real debugging tools.

Resources