OpenGL rendering performance - default shader or no shader? - performance

I have some code for rendering a video, so the OpenGL side of it (once the rendered frame is available in the target texture) is very simple: Just render it to the target rectangle.
What complicates things a bit is that I am using a third-party SDK to render the UI, so I cannot know what state changes it makes, and therefore every time I am rendering a frame I have to make sure all the states I need are set correctly.
I am using a vertex and a texture coordinate buffer to draw my rectangle like this:
glActiveTexture(GL_TEXTURE0);
glEnable(GL_TEXTURE_RECTANGLE_ARB);
glBindTexture(GL_TEXTURE_RECTANGLE_ARB, texHandle);
glTexEnvi(GL_TEXTURE_ENV, GL_TEXTURE_ENV_MODE, GL_REPLACE);
glPushClientAttrib( GL_CLIENT_VERTEX_ARRAY_BIT );
glEnableClientState( GL_VERTEX_ARRAY );
glEnableClientState( GL_TEXTURE_COORD_ARRAY );
glBindBuffer(GL_ARRAY_BUFFER, m_vertexBuffer);
glVertexPointer(4, GL_FLOAT, 0, 0);
glBindBuffer(GL_ARRAY_BUFFER, m_texCoordBuffer);
glTexCoordPointer(2, GL_FLOAT, 0, 0);
glDrawArrays(GL_QUADS, 0, 4);
glPopClientAttrib();
(Is there anything that I can skip - even when not knowing what is happening inside the UI library?)
Now I wonder -and this is more theoretical as I suppose there won't be much difference when just drawing one Quad- if it is theoretically faster to just render like the above, or instead write a simple default vertex and fragment shader that does maybe nothing more than returning ftransform() for the position and uses the default way for the fragment color too?
I wonder if by using a shader I can skip certain state changes, or generally speed up things. Or if by using the above code OpenGL internally just does that and the outcome will be exactly the same?

If you are worried about clobbering the UI SDK state, you should wrap the code with glPushAttrib(GL_ENABLE_BIT | GL_TEXTURE_BIT) ... glPopAttrib() as well.
You could simplify the state management code a bit by using a vertex array object.
As to using a shader, for this simple program I wouldn't bother. It would be one more bit of state you'd have to save & restore, and you're right that internally OpenGL is probably doing just that for the same outcome.
On speeding things up: performance is going to be dominated by the cost of sending tens? hundreds? of kilobytes of video frame data to the GPU, and adding or removing OpenGL calls is very unlikely to make a difference. I'd look first at possible differences in frame rate between the UI and video stream: for example, if the frame rate is faster, arrange for the video data to be copied once and re-used instead of copying it each time the UI is redrawn.
Hope this helps.

Related

OpenGL ES 3.x How to (performantly) render blended triangles front-to-back with alpha-blending and early-reject occluded fragments?

I recently found out that one can render alpha-blended primitives correctly not just back-to-front but also front-to-back (http://hacksoflife.blogspot.com/2010/02/alpha-blending-back-to-front-front-to.html) by using GL_ONE_MINUS_DST_ALPHA, GL_ONE, premultiplying the fragment's alpha in the fragment shader and clearing destination alpha to black before rendering.
It occurred to me that it would then be great if one could combine this with EITHER early-z rejection OR some kind of early "destination-alpha testing" in order to discard fragments that won't contribute to the final pixel color.
When rendering with front-to-back alpha-blending, a fragment can be skipped if the destination-alpha at this location already contains the value 1.0.
I did prototype-implement that by using GL_EXT_shader_framebuffer_fetch to test the destination alpha at the start of the pixel shader and then manually discard the fragment if the value is above a certain threshold. That works but it made things actually slower on my test hardware (Snapdragon XR2) - so I wonder:
whether it's somehow possible to not even have the fragment shader execute if destination alpha is already above a certain threshold?
alternatively, if it would be possible to only write to the depth buffer for fragments that are completely opaque and leave the current depth buffer value unchanged for all fragments that have an alpha value of less than 1 (but still depth-test every fragment), that should allow the hardware to use early-z rejection for occluded fragments. So,
Is this possible somehow (i.e. use depth testing, but update the depth buffer value only for opaque fragments and leave it unchanged for others)?
bottom line this would allow to reduce overdraw of alpha-blended sprites to only those fragments that contribute to the final pixel color and I wonder whether there is a performant way of doing this.
For number 2, I think you could modify gl_FragDepth in the fragment shader to achieve something close, but doing so would disable early-z rejection so wouldn't really help.
I think one viable way to reduce overdraw would be to create a tool to generate a mesh for each sprite which aims to cover a decent proportion of the opaque part of the sprite without using too many verts. I imagine for a typical sprite, even just a well placed quad could cover 80%+.
You'd render the generated opaque geometry of your sprites with depth write enabled, and do a second pass the ordinary way with depth testing enabled to cover the transparent parts.
You would massively reduce overdraw, but significantly increase the complexity of your code and number of verts rendered. You would double your draw calls, but if you're atlassing and using texture arrays, you might be doubling from 1 to 2 draw calls which is fine. I've never tried it so can't say if it's worth all the effort that would be involved.

How can I properly manage data in modern OpenGL while considering performance?

In modern OpenGL (3.x+), you create buffer objects which contain vertex attributes, such as positions, colors, normals, texture coordinatess, & indices.
These buffers are then assigned to a corresponding vertex array object (VAO) which essentially contains pointers to all of the data as well as the data's format.
There are many tutorials out there for how to create a VAO and how to use it; unfortunately, it isn't clear how VAO's should be used for larger applications or games.
For example, a game might contain many 3D models, and it seems appropriate to separate each model by a different VAO.
On the other hand, a particle system contains many disconnected primitives traveling independent of one another. In this scenario, using a single VAO per system might improve performance in CPU-GPU transfers. However, in this case, the primitives need to be translated differently than one another, so it might seem viable to separate each particle into a very tiny VAO.
Question:
For a large quantity of small data sets (such as a particle system of quads), should all of the data be packed into 1 VAO or divided into many VAO's? What are the performance benefits/drawbacks in each method?
Assumming 1 VAO is used, the only apparent way to translate each independent sub-unit of data is to modify the actual position information and reload it into the GPU. Doing this many times is costly in terms of time performance.
Assuming many VAO's are used, then the GPU must store duplicate formatting information for each VAO. This seems to be costly in terms of space (but I'm not sure if this is necessarily slow).
Side-Note:
Yes, I'm personally interested in managing a particle system. To keep this question more generic, and more useful for others, I am asking about VAO management as a whole. I am curious what management methods are more suitable vs others when considering the type of data being stored and when considering what type of performance is desired (time/space).
VAO creation is described well here:
https://www.opengl.org/wiki/Vertex_Specification
In the case of particles it would be best to use instanced rendering - where you can render all the particles in a single draw call but assign a different position for each one as an attribute. You can update an existing buffer using glSubData. That way you could update the position on the CPU side between frames, and then update the buffer.
In more complex examples you can instance whichever attributes you want to.
The way I call instanced rendering and set it up in my code is as follows:
void CreateInstancedAttrib(unsigned int attribNum,GLuint VAO,GLuint& posVBO,int numInstances){
glBindVertexArray(VAO);
posVBO = CreateVertexArrayBuffer(0, sizeof(vec3),numInstances,GL_DYNAMIC_DRAW);
glEnableVertexAttribArray(attribNum);
glVertexAttribPointer(attribNum, 3, GL_FLOAT, GL_FALSE, sizeof(vec3), 0);
glVertexAttribDivisor(attribNum, 1);
glBindVertexArray(0);
}
Where posVBO is the usual attrib data and the lines following set up the buffer for positions.
When rendering:
void RenderInstancedStaticMesh(const StaticMesh& mesh, MaterialUniforms& uniforms,const vec3* positions){
for (unsigned int meshNum = 0; meshNum < mesh.m_numMeshes; meshNum++){
if (mesh.m_meshData[meshNum]->m_hasTexture){
glBindTexture(GL_TEXTURE_2D, mesh.m_meshData[meshNum]->m_texture);
}
glBindVertexArray(mesh.m_meshData[meshNum]->m_vertexBuffer);
glBindBuffer(GL_ARRAY_BUFFER, mesh.m_meshData[meshNum]->m_instancedDataBuffer);
glBufferSubData(GL_ARRAY_BUFFER,0, sizeof(vec3) * mesh.m_numInstances, positions);
glUniform3fv(uniforms.diffuseUniform, 1, &mesh.m_meshData[meshNum]->m_material.diffuse[0]);
glUniform3fv(uniforms.specularUniform, 1, &mesh.m_meshData[meshNum]->m_material.specular[0]);
glUniform3fv(uniforms.ambientUniform, 1, &mesh.m_meshData[meshNum]->m_material.ambient[0]);
glUniform1f(uniforms.shininessUniform, mesh.m_meshData[meshNum]->m_material.shininess);
glDrawElementsInstanced(GL_TRIANGLES, mesh.m_meshData[meshNum]->m_numFaces * 3,
GL_UNSIGNED_INT, 0,mesh.m_numInstances);
}
glBindBuffer(GL_ARRAY_BUFFER, 0);
glBindVertexArray(0);
}
That's a lot to take in but the important lines are DrawElementsInstance and glBufferSubData.
If you do a few googles on both functions I'm sure you will come to understand how instanced rendering works.
Anymore questions please ask
The general rule is, that you want to minimize the amount of draw calls. If you put things into individual VAOs you have to perform a draw call for each VAO. Also switching between VAOs and VBOs comes with a cost either. Don't think of VAOs and VBOs as "model" containsers, but as memory pools, where each VBO / VAO should be used to coalesce data of identical properties.
A particle system is the perfect candidate to put everything into a single VBO/VAO. In the usual case using instanced rendering where the VBO contain information about where to place each particle.

glTexSubImage2D extremely slow on Intel video card

My video card is Mobile Intel 4 Series. I'm updating a texture with changing data every frame, here's my main loop:
for(;;) {
Timer timer;
glBindTexture(GL_TEXTURE2D, tex);
glBegin(GL_QUADS); ... /* draw textured quad */ ... glEnd();
glTexSubImage2D(GL_TEXTURE2D, 0, 0, 0, 512, 512,
GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, data);
swapBuffers();
cout << timer.Elapsed();
}
Every iteration takes 120ms. However, inserting glFlush before glTexSubImage2D brings the iteration time to 2ms.
The issue is not in the pixel format. I've tried the pixel formats BGRA, RGBA and ABGR_EXT together with the pixel types UNSIGNED_BYTE, BYTE, UNSIGNED_INT_8_8_8_8 and UNSIGNED_INT_8_8_8_8_EXT. The texture's internal pixel format is RGBA.
The order of calls matters. Moving the texture upload before the quad drawing, for example, fixes the slowness.
I also tried this on an GeForce GT 420M card, and it works fast there. My real app does have performance problems on non-Intel cards that are fixed by glFlush calls, but I haven't distilled those to a test case yet.
Any ideas on how to debug this?
One issue is that glTexImage2D performs a full reinitialization of the texture object. If only the data changes, but the format remains the same, use glTexSubImage2D to speed things up (just a reminder).
The other issue is, that despite its name the immediate mode, i.e. glBegin(…) … glEnd() the drawing calls are not synchronous, i.e. the calls return long before the GPU is done drawing. Adding a glFinish() will synchronize. But as well will do calls to anything that modifies data still required by queued operations. So in your case glTexImage2D (and glTexSubImage2D) must wait for the drawing to finish.
Usually it's best to do all volatile resource uploads at either the beginning of the drawing function, or during the SwapBuffers block in a separate thread through buffer objects. Buffer objects have been introduced for that very reason, to allow for asynchronous, yet tight operation.
I assume you're actually using that texture for one or more of your quads?
Uploading textures is one of the most expensive operations possible. Since your texture data changes every frame, the upload is unavoidable, but you should try to do it when the texture isn't in use by shaders. Remember that glBegin(GL_QUADS); ... glEnd(); doesn't actually draw quads, it requests that the GPU render the quads. Until the rendering completes, the texture will be locked. Depending on the implementation, this might cause the texture upload to wait (ala glFlush), but it could also cause the upload to fail, in which case you've wasted megabytes of PCIe bandwidth and the driver has to retry.
It sounds like you already have a solution: upload all new textures at the beginning of the frame. So what's your question?
NOTE: Intel integrated graphics are horribly slow anyway.
When you make a Draw Call ( glDrawElements, other ), the driver simply add this call in a buffer, and let the GPU consume these commands when it can.
If this buffer had to be consumed entirely at glSwapBuffers, this would mean that the GPU would be idle after that, waiting for you to send new commands.
Drivers solve this by letting the GPU lag one frame behind. This is the first reason why glTexSubImage2D blocks : the driver waits for the GPU not to use it anymore (in the previous frame) to begin the transfer, so that you never get half-updated data.
The other reason is that glTexSubImage2D is synchronous. Il will also block during the whole transfer.
You can solve the first issue by keeping 2 textures : one for the current frame, one for the previous frame. Upload the texture in the former, but draw with the latter.
You can solve the second issue by using a GL_TEXTURE_BUFFER Buffer Object, which allows asynchronous transfers.
In your case, I suspect that calling glTexSubImage2D just before glSwapBuffer adds an extra synchronization in the driver, whereas drawing the quad just before glSwapBuffer simply appends the command in the buffer. 120ms is probably a driver bug, though : even an Intel GMA doesn't need 120ms to upload a 512x512 texture.

OpenGL - Fast Textured Quads?

I am trying to display as many textured quads as possible at random positions in the 3D space. In my experience so far, I cannot display even a couple of thousands of them without dropping the fps significantly under 30 (my camera movement script becomes laggy).
Right now I am following an ancient tutorial. After initializing OpenGL:
glEnable(GL_TEXTURE_2D);
glShadeModel(GL_SMOOTH);
glClearColor(0, 0, 0, 0);
glClearDepth(1.0f);
glEnable(GL_DEPTH_TEST);
glDepthFunc(GL_LEQUAL);
glHint(GL_PERSPECTIVE_CORRECTION_HINT, GL_NICEST);
I set the viewpoint and perspective:
glViewport(0,0,width,height);
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
gluPerspective(60.0f,(GLfloat)width/(GLfloat)height,0.1f,100.0f);
glMatrixMode(GL_MODELVIEW);
glLoadIdentity();
Then I load some textures:
glGenTextures(TEXTURE_COUNT, &texture[0]);
for (int i...){
glBindTexture(GL_TEXTURE_2D, texture[i]);
glTexParameteri(GL_TEXTURE_2D,GL_TEXTURE_MAG_FILTER,GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D,GL_TEXTURE_MIN_FILTER,GL_LINEAR_MIPMAP_NEAREST);
gluBuild2DMipmaps(GL_TEXTURE_2D,3,TextureImage[0]->w,TextureImage[0]->h,GL_RGB,GL_UNSIGNED_BYTE,TextureImage[0]->pixels);
}
And finally I draw my GL_QUADS using:
glBindTexture(GL_TEXTURE_2D, q);
glTranslatef(fDistanceX,fDistanceZ,-fDistanceY);
glBegin(GL_QUADS);
glNormal3f(a,b,c);
glTexCoord2f(d, e); glVertex3f(x1, y1, z1);
glTexCoord2f(f, g); glVertex3f(x2, y2, z2);
glTexCoord2f(h, k); glVertex3f(x3, y3, z3);
glTexCoord2f(m, n); glVertex3f(x4, y4, z4);
glEnd();
glTranslatef(-fDistanceX,-fDistanceZ,fDistanceY);
I find all that code very self explaining. Unfortunately that way to do things is deprecated, as far as I know. I read some vague things about PBO and vertexArrays on the internet, but i did not find any tutorial on how to use them. I don't even know if these objects are suited to realize what I am trying to do here (a billion quads on the screen without a lag). Perhaps anyone here could give me a definitive suggestion, of what I should use to achieve the result? And if you happen to have one more minute of spare time, could you give me a short summary of how these functions are used (just as i did with the deprecated ones above)?
Perhaps anyone here could give me a definitive suggestion, of what I should use to achieve the result?
What is "the result"? You have not explained very well what exactly it is that you're trying to accomplish. All you've said is that you're trying to draw a lot of textured quads. What are you trying to do with those textured quads?
For example, you seem to be creating the same texture, with the same width and height, given the same pixel data. But you store these in different texture objects. OpenGL does not know that they contain the same data. Therefore, you spend a lot of time swapping textures needlessly when you render quads.
If you're just randomly drawing them to test performance, then the question is meaningless. Such tests are pointless, because they are entirely artificial. They test only this artificial scenario where you're changing textures every time you render a quad.
Without knowing what you are trying to ultimately render, the only thing I can do is give general performance advice. In order (ie: do the first before you do the later ones):
Stop changing textures for every quad. You can package multiple images together in the same texture, then render all of the quads that use that texture at once, with only one glBindTexture call. The texture coordinates of the quad specifies which image within the texture that it uses.
Stop using glTranslate to position each individual quad. You can use it to position groups of quads, but you should do the math yourself to compute the quad's vertex positions. Once those glTranslate calls are gone, you can put multiple quads within the space of a single glBegin/glEnd pair.
Assuming that your quads are static (fixed position in model space), consider using a buffer object to store and render with your quad data.
I read some vague things about PBO and vertexArrays on the internet, but i did not find any tutorial on how to use them.
Did you try the OpenGL Wiki, which has a pretty good list of tutorials (as well as general information on OpenGL)? In the interest of full disclosure, I did write one of them.
I heard, in modern games milliards of polygons are rendered in real time
Actually its in the millions. I presume you're German: "Milliarde" translates into "Billion" in English.
Right now I am following an ancient tutorial.
This is your main problem. Contemporary OpenGL applications don't use ancient rendering methods. You're using the immediate mode, which means that you're going through several function calls to just submit a single vertex. This is highly inefficient. Modern applications, like games, can reach that high triangle counts because they don't waste their CPU time on calling as many functions, they don't waste CPU→GPU bandwidth with the data stream.
To reach that high counts of triangles being rendered in realtime you must place all the geometry data in the "fast memory", i.e. in the RAM on the graphics card. The technique OpenGL offers for this is called "Vertex Buffer Objects". Using a VBO you can draw large batches of geometry using a single drawing call (glDrawArrays, glDrawElements and their relatives).
After getting the geometry out of the way, you must be nice to the GPU. GPUs don't like it, if you switch textures or shaders often. Switching a texture invalidates the contents of the cache(s), switching a shader means stalling the GPU pipeline, but worse it means invalidating the execution path prediction statistics (the GPU takes statistics which execution paths of a shader are the most probable to be executed and which memory access patterns it exhibits, this used to iteratively optimize the shader execution).

DirectX9 - Efficiently Drawing Sprites

I'm trying to create a platformer game, and I am taking various sprite blocks, and piecing them together in order to draw the level. This requires drawing a large number of sprites on the screen every single frame. A good computer has no problem handling drawing all the sprites, but it starts to impact performance on older computers. Since this is NOT a big game, I want it to be able to run on almost any computer. Right now, I am using the following DirectX function to draw my sprites:
D3DXVECTOR3 center(0.0f, 0.0f, 0.0f);
D3DXVECTOR3 position(static_cast<float>(x), static_cast<float>(y), z);
(my LPD3DXSPRITE object)->Draw((sprite texture pointer), NULL, &center, &position, D3DCOLOR_ARGB(a, r, g, b));
Is there a more efficient way to draw these pictures on the screen? Is there a way that I can use less complex picture files (I'm using regular png's right now) to speed things up?
To sum it up: What is the most performance friendly way to draw sprites in DirectX? thanks!
The ID3DXSPRITE interface you are using is already pretty efficient. Make sure all your sprite draw calls happen in one batch if possible between the sprite begin and end calls. This allows the sprite interface to arrange the draws in the most efficient way.
For extra performance you can load multiple smaller textures in to one larger texture and use texture coordinates to get them out. This makes it so textures don't have to be swapped as frequently. See:
http://nexe.gamedev.net/directknowledge/default.asp?p=ID3DXSprite
The file type you are using for the textures does not matter as long as they are are preloaded into textures. Make sure you load them all in to textures once when the game/level is loading. Once you have loaded them in to textures it does not matter what format they were originally in.
If you still are not getting the performance you want, try using PIX to profile your application and find where the bottlenecks really are.
Edit:
This is too long to fit in a comment, so I will edit this post.
When I say swapping textures I mean binding them to a texture stage with SetTexture. Each time SetTexture is called there is a small performance hit as it changes the state of the texture stage. Normally this delay is fairly small, but can be bad if DirectX has to pull the texture from system memory to video memory.
ID3DXsprite will reorder the draws that are between begin and end calls for you. This means SetTexture will typically only be called once for each texture regardless of the order you draw them in.
It is often worth loading small textures into a large one. For example if it were possible to fit all small textures in to one large one, then the texture stage could just stay bound to that texture for all draws. Normally this will give a noticeable improvement, but testing is the only way to know for sure how much it will help. It would look terrible, but you could just throw in any large texture and pretend it is the combined one to test what performance difference there would be.
I agree with dschaeffer, but would like to add that if you are using a large number different textures, it may better to smush them together on a single (or few) larger textures and adjust the texture coordinates for different sprites accordingly. Texturing state changes cost a lot and this may speed things up on older systems.

Resources