Improving the performance of Webgl2 texSubImage2D call with large texture - performance

Using WebGL2 I stream a 4K by 2K stereoscopic video as a texture onto the inside of a sphere in order to provide 360° VR video playback capability. I've optimized as much of the codebase as is feasible given the returns on time and the application runs flawlessly when using an .H264 video source.
However; when using 8bit VP8 or VP9 (which offer superior fidelity and file size, AV1 isn't available to me) I encounter FPS drops on weaker systems due to the extra CPU requirements for decoding VP8/VP9 video.
When profiling the app, I've identified that the per-frame call of texSubImage2D that updates the texture from the video consumes the large majority of each frame (texImage2D was even worse due to it's allocations), but am unsure how to further optimize it's use. Below are the things I'm already doing to minimize it's impact:
I cache the texture's memory space at initial load using texStorage2D to keep it as contiguous as possible.
let glTexture = gl.createTexture();
let pixelData = new Uint8Array(4096*2048*3);
pixelData.fill(255);
gl.bindTexture(GL.TEXTURE_2D, glTexture);
gl.texStorage2D(GL.TEXTURE_2D, 1, GL.RGB8, 4096, 2048);
gl.texSubImage2D(GL.TEXTURE_2D, 0, 0, 0, 4096, 2048, GL.RGB, GL.RGB, pixelData);
gl.generateMipmap(GL.TEXTURE_2D);
Then, during my render loop, both left and right eye-poses are processed for each object before moving on to the next object. This allows me to only need to call gl.bindTexture and gl.texSubImage2D once per object per frame. Additionally I also, skip populating shader program defines if the material for this entity is the same as the one for the previous entity, the video is paused, or still loading.
/* Main Render Loop Extract */
//Called each frame after pre-sorting entities
function DrawScene(glLayer, pose, scene){
//Entities are pre-sorted for transparency blending, rendering opaque first, and transparent second.
for (let ii = 0; ii < _opaqueEntities.length; ii++){
//Only render if entity and it's parent chain are active
if(_opaqueEntities[ii] && _opaqueEntities[ii].isActiveHeirachy){
for (let i = 0; i < pose.views.length; i++) {
_RenderEntityView(pose, i, _opaqueEntities[ii]);
}
}
}
for (let ii = 0; ii < _transparentEntities.length; ii++) {
//Only render if entity and it's parent chain are active
if(_transparentEntities[ii] && _transparentEntities[ii].isActiveHeirachy){
for (let i = 0; i < pose.views.length; i++) {
_RenderEntityView(pose, i, _transparentEntities[ii]);
}
}
}
}
let _programData;
function _RenderEntityView(pose, viewIdx, entity){
//Calculates/manipualtes view matrix for entity for this view. (<0.1ms)
//...
//Store reference to make stack overflow lines shorter :-)
_programData = entity.material.shaderProgram;
_BindEntityBuffers(entity, _programData);//The buffers Thomas, mind the BUFFERS!!!
gl.uniformMatrix4fv(
_programData.uniformData.uProjectionMatrix,
false,
_view.projectionMatrix
);
gl.uniformMatrix4fv(
_programData.uniformData.uModelViewMatrix,
false,
_modelViewMatrix
);
//Render all triangles that make up the object.
gl.drawElements(GL.TRIANGLES, entity.tris.length, GL.UNSIGNED_SHORT, 0);
}
let _attrName;
let _attrLoc;
let textureData;
function _BindEntityBuffers(entity, programData){
gl.useProgram(programData.program);
//Binds pre-defined shader atributes on an as needed basis
for(_attrName in programData.attributeData){
_attrLoc = programData.attributeData[_attrName];
//Bind only if exists in shader
if(_attrLoc.key >= 0){
_BindShaderAttributes(_attrLoc.key, entity.attrBufferData[_attrName].buffer,
entity.attrBufferData[_attrName].compCount);
}
}
//Bind triangle index buffer
gl.bindBuffer(GL.ELEMENT_ARRAY_BUFFER, entity.triBuffer);
//If already in use, is instanced material so skip configuration.
if(_materialInUse == entity.material){return;}
_materialInUse = entity.material;
//Use the material by applying it's specific uniforms
//Apply base color
gl.uniform4fv(programData.uniformData.uColor, entity.material.color);
//If shader uses a difuse texture
if(programData.uniformData.uDiffuseSampler){
//Store reference to make stack overflow lines shorter :-)
textureData = entity.material.diffuseTexture;
gl.activeTexture(gl.TEXTURE0);
//Use assigned texture
gl.bindTexture(gl.TEXTURE_2D, textureData);
//If this is a video, update the texture buffer using the current video's playback frame data
if(textureData.type == TEXTURE_TYPE.VIDEO &&
textureData.isLoaded &&
!textureData.paused){
//This accounts for 42% of all script execution time!!!
gl.texSubImage2D(gl.TEXTURE_2D, textureData.level, 0, 0,
textureData.width, textureData.height, textureData.internalFormat,
textureData.srcType, textureData.video);
}
gl.uniform1i(programData.uniformData.uDiffuseSampler, 0);
}
}
function _BindShaderAttributes(attrKey, buffer, compCount, type=GL.FLOAT, normalize=false, stride=0, offset=0){
gl.bindBuffer(GL.ARRAY_BUFFER, buffer);
gl.vertexAttribPointer(attrKey, compCount, type, normalize, stride, offset);
gl.enableVertexAttribArray(attrKey);
}
I've contemplated using pre-defined counters for all for loops to avoid the var i=0; allocation, but the gain from that seems hardly worth the effort.
Side Note, The source video is actually larger than 4K, but anything above 4K and FPS grinds to about 10-12.
Obligatory: The key functionality above is extracted from a larger WebGL rendering framework I wrote that itself runs pretty damn fast already. The reason I'm not 'just using' Three, AFrame, or other such common libraries is that they do not have an ATO from the DOD, whereas in-house developed code is ok.
Update 9/9/21: At some point when chrome updated from 90 to 93 the WebGL performance of texSubImage2D dropped dramatically, resulting in +100ms per frame execution regardless of CPU/GPU capability. Changing to use texImage2D now results in around 16ms per frame. In addition shifting from RGB to RGB565 offers up a few ms of performance while minimally sacrificing color.
I'd still love to hear from GL/WebGL experts as to what else I can do to improve performance.

Related

FFMPEG's sws_scale for converting RGB to YUV420 image is extremely slow

I am creating a basic application for recording screen activity using FFMPEG Library calls.
My Program flow is as below -
Fetch input data from framebuffer (in RGB format) --> Converting to YUV420 format and scaling to desirable resolution -> Encode the frame and Send to Muxer for MPEG2 convesrion.
Input data to my program is raw frambuffer data in RGB Format. I am using FFMPEG's sws_scale api for converting RGB to YUV420 image for encoding.
Below is the code for converting the pixel format-
static int get_frame_buffer_data(AVFrame *pict, int frame_index, int width,
int height, enum AVPixelFormat pix_fmt, char *rawFrame)
{
struct SwsContext *sws_ctx = NULL;
int ret = 0;
rfbLog("[%s:%d]before conv_frame alloc:::pix_fmt = %d width = %d height = %d\n",__func__,__LINE__,pix_fmt,width,height);
//picture->data[0] = (uint8_t*)&frameBuffer[0];
picture->data[0] = (uint8_t*)&rawFrame[0];
sws_ctx = sws_getCachedContext(sws_ctx,picture->width, picture->height, picture->format,width, height, pix_fmt,SWS_BICUBIC, NU LL, NULL, NULL);
if (!sws_ctx)
{
rfbLog("[%s:%d]Cannot initialize the conversion context\n",__func__,__LINE__);
av_frame_free(&picture);
sws_freeContext(sws_ctx);
return -1;
}
rfbLog("[%s:%d]before sws_scale::: picture->linesize[0]=%d picture->height=%d pict->linesize = %d\n",__func__,__LINE__,picture ->linesize[0],picture->height,pict->linesize[0]);
ret = sws_scale(sws_ctx, (const uint8_t * const *)picture->data, picture->linesize, 0, picture->height, pict->data, pict->line size);
rfbLog("[%s:%d]after sws_scale::: picture->linesize[0]=%d picture->height=%d pict->linesize = %d returned height = %d\n",__fun c__,__LINE__, picture->linesize[0],picture->height,pict->linesize[0], ret);
if (ret < 0)
{
rfbLog("[%s:%d]could not convert to yuv420\n",__func__,__LINE__);
sws_freeContext(sws_ctx);
av_frame_free(&picture);
return -1;
}
sws_freeContext(sws_ctx);
return 0;
}
I noticed that adding this code is making the application very slow. Profiling data gave me the information that sws_scale is taking very much time for converting the data to YUV420 . It is taking almost around 200 ms, which is making CPU time utilization very high and making my application unresponsive sometime.
Can we optimize this or use any alternative solution for conversion and how can we achieve that?
"Optimization" in software means achieving the same result with fewer instructions. You call one function sws_scale So unless you what to modify the ffmpeg source code, The only "optimization" you can do is at compile time. What options were set what ffmpeg was compiled? Try recompiling with -O3.
Other options:
Switch to zimg, Its a little bit faster, but not much because color conversion is a complicated process.
User a faster computer. 200ms is pretty slow, Unless you are scaling very large images, I suspect the is stunning on an underpowered CPU.
Use more threads. Assuming the CPU has more than one core (Or the machine has more than one CPU) You can run multiple instance of sws_scale Each frame will still be 200ms, But if you can do more at the same time, it will bring the average down.
I don't think that the problem is on sws_scale and I don't believe that you can actually optimize it as far as the convert happens on CPU.
If you really want good performance you should look for Post-Process hardware (GPU) acceleration and pixel shaders (based on the OS you use, eg DirectX for Windows).

Reading Pixels in WebGL 2 as Float values

I need to read the pixels of my framebuffer as float values.
My goal is to get a fast transfer of lots of particles between CPU and GPU and process them in realtime. For that I store the particle properties in a floating point texture.
Whenever a new particle is added, I want to get the current particle array back from the texture, add the new particle properties and then fit it back into the texture (this is the only way I could think of to dynamically add particles and process them GPU-wise).
I am using WebGL 2 since it supports reading back pixels to a PIXEL_PACK_BUFFER target. I test my code in Firefox Nightly. The code in question looks like this:
// Initialize the WebGLBuffer
this.m_particlePosBuffer = gl.createBuffer();
gl.bindBuffer(gl.PIXEL_PACK_BUFFER, this.m_particlePosBuffer);
gl.bindBuffer(gl.PIXEL_PACK_BUFFER, null);
...
// In the renderloop, bind the buffer and read back the pixels
gl.bindBuffer(gl.PIXEL_PACK_BUFFER, this.m_particlePosBuffer);
gl.readBuffer(gl.COLOR_ATTACHMENT0); // Framebuffer texture is bound to this attachment
gl.readPixels(0, 0, _texSize, _texSize, gl.RGBA, gl.FLOAT, 0);
I get this error in my console:
TypeError: Argument 7 of WebGLRenderingContext.readPixels could not be converted to any of: ArrayBufferView, SharedArrayBufferView.
But looking at the current WebGL 2 Specification, this function call should be possible. Using the type gl.UNSIGNED_BYTE also returns this error.
When I try to read the pixels in an ArrayBufferView (which I want to avoid since it seems to be way slower) it works with the format/type combination of gl.RGBA and gl.UNSIGNED_BYTE for a Uint8Array() but not with gl.RGBA and gl.FLOAT for a Float32Array() - this is as expected since it's documented in the WebGL Specification.
I am thankful for any suggestions on how to get my float pixel values from my framebuffer or on how to otherwise get this particle pipeline going.
Did you try using this extension?
var ext = gl.getExtension('EXT_color_buffer_float');
The gl you have is webgl1,not webgl2.Try:
var gl = document.getElementById("canvas").getContext('webgl2');
In WebGL2 the syntax for glReadPixel is
void gl.readPixels(x, y, width, height, format, type, ArrayBufferView pixels, GLuint dstOffset);
so
let data = new Uint8Array(gl.drawingBufferWidth * gl.drawingBufferHeight * 4);
gl.readPixels(0, 0, gl.drawingBufferWidth, gl.drawingBufferHeight, gl.RGBA, gl.UNSIGNED_BYTE, pixels, 0);
https://developer.mozilla.org/en-US/docs/Web/API/WebGLRenderingContext/readPixels

In Processing, how can I save part of the window as an image?

I am using Processing under Fedora 20, and I want to display an image of the extending tracks of objects moving across part of the screen, with each object displayed at its current position at the end of the track. To avoid having to record all the co-ordinates of the tracks, I usesave("image.png"); to save the tracks so far, then draw the objects. In the next frame I use img = loadImage("image.png"); to restore the tracks made so far, without the objects, which would still be in their previous positions.. I extend the tracks to their new positions, then usesave("image.png"); to save the extended tracks, still without the objects, ready for the next loop round. Then I draw the objects in their new positions at the end of their extended tracks. In this way successive loops show the objects advancing, with their previous positions as tracks behind them.
This has worked well in tests where the image is the whole frame, but now I need to put that display in a corner of the whole frame, and leave the rest unchanged. I expect that createImage(...) will be the answer, but I cannot find any details of how to to so.
A similar question asked here has this recommendation: "The PImage class contains a save() function that exports to file. The API should be your first stop for questions like this." Of course I've looked at that API, but I don't think it helps here, unless I have to create the image to save pixel by pixel, in which case I would expect it to slow things down a lot.
So my question is: in Processing can I save and restore just part of the frame as an image, without affecting the rest of the frame?
I have continued to research this. It seems strange to me that I can find oodles of sketch references, tutorials, and examples, that save and load the entire frame, but no easy way of saving and restoring just part of the frame as an image. I could probably do it using Pimage but that appears to require an awful lot of image. in front of everything to be drawn there.
I have got round it with a kludge: I created a mask image (see this Processing reference) the size of the whole frame. The mask is defined as grey areas representing opacity, so that white, zero opacity (0), is transparent and black, fully opaque (255) completely conceals the background image, thus:
{ size (1280,800);
background(0); // whole frame is transparent..
fill(255); // ..and..
rect(680,0,600,600); // ..smaller image area is now opaque
save("[path to sketch]/mask01.jpg");
}
void draw(){}
Then in my main code I use:
PImage img, mimg;
img = loadImage("image4.png"); // The image I want to see ..
// .. including the rest of the frame which would obscure previous work
mimg = loadImage("mask01.jpg"); // create the mask
//apply the mask, allowing previous work to show though
img.mask(mimg);
// display the masked image
image(img, 0, 0);
I will accept this as an answer if no better suggestion is made.
void setup(){
size(640, 480);
background(0);
noStroke();
fill(255);
rect(40, 150, 200, 100);
}
void draw(){
}
void mousePressed(){
PImage img =get(40, 150, 200, 100);
img.save("test.jpg");
}
Old news, but here's an answer: you can use the pixel array and math.
Let's say that this is your viewport:
You can use loadPixels(); to fill the pixels[] array with the current content of the viewport, then fish the pixels you want from this array.
In the given example, here's a way to filter the unwanted pixels:
void exportImage() {
// creating the image to the "desired size"
PImage img = createImage(600, 900, RGB);
loadPixels();
int index = 0;
for(int i=0; i<pixels.length; i++) {
// filtering the unwanted first 200 pixels on every row
// remember that the pixels[] array is 1 dimensional, so some math are unavoidable. For this simple example I use the modulo operator.
if (i % width >= 200) { // "magic numbers" are bad, remember. This is only a simplification.
img.pixels[index] = pixels[i];
index++;
}
}
img.updatePixels();
img.save("test.png");
}
It may be too late to help you, but maybe someone else will need this. Either way, have fun!

Does webgl drawArray() empty/discard the buffers?

Trying to speed up the display of many near-identical objects in WebGL, I tried (naively, I guess), to re-use the buffers content. In the drawing routine of each object, I have (somewhat simplified):
if (! dataBuffered) {
dataBuffered = true;
:
: gl stuff here: texture loading, buffer binding and filling
:
}
// set projection and model-view matrices
gl.uniformMatrix4fv (shaderProgram.uPMatrix, false, pMatrix);
gl.uniformMatrix4fv (shaderProgram.uMVMatrix, false, mvMatrix);
// draw rectangle filled with texture
gl.drawArrays(gl.TRIANGLE_STRIP, 0, starVertexPositionBuffer.numItems);
My idea was that the texture, vertex, and texture coordinates buffer are the same, but the model-view matrix changes (same object in different places). But, alas, nothing shows up. When I comment the dataBuffered = true, it's visible.
So my question is, does drawArray() discard or empty the buffers? What else is happening? (I'm working along the lessons at learningwebgl.com, if that matters.)
Short answer is, Yes, you can reuse all the state you've set up for more than one gl.drawArrays().
http://omino.com/experiments/webgl/simplestWebGlReuseBuffers.html is a little example where it just changes one uniform float (Y-scale) and redraws, twice for every tick.
(In this case there's no textures, but some other state stays sticky.)
Hope that helps!
uniformSetFloat(gl,prog,"scaleY",1.0);
gl.drawArrays(gl.TRIANGLES, 0, posPoints.length / 3);
uniformSetFloat(gl,prog,"scaleY",0.2);
gl.drawArrays(gl.TRIANGLES, 0, posPoints.length / 3);

Resizing and saving an image in WinMobile and .NET CF throws OutOfMemoryException

I have a WinMobile app which allows the user the snap a photo with the camera, and then use for for various things. The photo can be snapped at 1600x1200, 800x600 or 640x480, but it must always be resized to 400px for the longest size (the other is proportional of course). Here's the code:
private void LoadImage(string path)
{
Image tmpPhoto = new Bitmap(path);
// calculate new bitmap size...
double width = ...
double height = ...
// draw new bitmap
Image photo = new Bitmap(width, height, System.Drawing.Imaging.PixelFormat.Format24bppRgb);
using (Graphics g = Graphics.FromImage(photo))
{
g.FillRectangle(new SolidBrush(Color.White), new Rectangle(0, 0, photo.Width, photo.Height));
int srcX = (int)((double)(tmpPhoto.Width - width) / 2d);
int srcY = (int)((double)(tmpPhoto.Height - height) / 2d);
g.DrawImage(tmpPhoto, new Rectangle(0, 0, photo.Width, photo.Height), new Rectangle(srcX, srcY, photo.Width, photo.Height), GraphicsUnit.Pixel);
}
tmpPhoto.Dispose();
// save new image and dispose
photo.Save(Path.Combine(config.TempPath, config.TempPhotoFileName), System.Drawing.Imaging.ImageFormat.Jpeg);
photo.Dispose();
}
Now the problem is that the app breaks in the photo.Save call, with an OutOfMemoryException. And I don't know why, since I dispose the tempPhoto (with the original photo from the camera) as soon as I can, and I also dispose the Graphics obj. Why does this happen? It seems impossible to me that one can't take a photo with the camera and resize/save it without making it crash :( Should I restor t C++ for such a simple thing?
Thanks.
Have you looked at memory usage with each step to see exactly where you're using the most? You omitted your calculations for width and height, but assuming they are right you would end up with photo requiring 400x300x3 (24bits) == 360k for the bitmap data itself, which is not inordinately large.
My guess is that even though you're calling Dispose, the resources aren't getting rleased, especially if you're calling this method multiple times. The CF behaves in an unexpected way with Bitmaps. I call it a bug. The CF team doesn't.

Resources