I plan to eliminate all glUniform calls from my GLSL shaders in order to save costs in state switching. For that purpose, I plan to use an UBO that is bound to the shader permanently. Different draw calls use different parts of the UBO (it's basically an array). In order to tell the draw call which entry to use, I have to submit an integer to the vertex/fragment shaders. The problem is, that on the system I have to use even casting a single glUniform call will cause an expensive state update, so I cannot use glUniform at all.
Do you know a solution that will work on GLES 3.1 and one that will work on GLES 2?
GLES doesn't have glMulti* calls yet and base vertex only from 3.2 upwards as far as I know. And adding another vertex attribute may be costly.
Related
I'm currently rewriting a shader written in GLES30 for the GLES20 shader language.
I've hit a snag where the shader I need to convert makes a call to the function textureLod, which samples the currently bound texture using a specific level-of-detail. This call is made within the fragment shader, which can only be called within the vertex shader when using GLES20.
I'm wondering, if I replace this with a call with the function texture2D, will I be likely to compromise the function of the shader, or just reduce it's performance? All instances where the textureLod call is made within the original shader uses a level of detail of zero.
If you switch calls from textureLod to texture2D, you will lose control over which mip-level is being sampled.
If the texture being sampled only has a single mip-level, then the two calls are equivalent, regardless of the lod parameter passed to textureLod, because there is only one level that could be sampled.
If the original shader always samples the top mip level (=0), it is unlikely that the change could hurt performance, as sampling lower mip-levels would more likely give better texture cache performance. If possible, you could have your sampled texture only include a top level to guarantee equivalence (unless the mip levels are required somewhere else). If this isn't possible, then the execution will be different. If the sample is used for 'direct' texturing, it is likely that the results will be fairly similar, assuming a nicely generated mip-chain. If it is used for other purposes (eg. logic within the shader), then the divergence might be larger. It's difficult to predict without seeing the actual shader.
Also note that, if the texture sample is used within a loop or conditional, and has been ported to/from a DirectX HLSL shader at any point in its lifetime, the call to textureLod may be an artifact of HLSL not allowing gradient instructions within dynamic loops (of which the HLSL equivalent of texture2D is, but equivalent of textureLod is not). This is required in HLSL, even if the texture only has a single mip-level.
I'm making a WebGL game and eventually came up with a pretty convenient concept of object templates, when the game objects of the same kind (say, characters of the same race) are using the same template (which means: buffers, attributes and shader program), and are instanced from that template by specifying a set of uniforms (which are, in fact, the most common difference between the same-kind objects: model matrix, textures, bones positions, etc). For making independent objects with their own deep-copy of buffers, I just deep-copy and re-initialize the original template and start instantiating new objects from it.
But after that I started having doubts. Say, if I start using morphing on objects, by explicit editing of the vertices, this approach will require me to make a separate template for every object of such kind (otherwise, they would start morphing in exactly the same phase). Which is probably fine for this very case, 'cause I'll most likely need to recalculate normals and even texture coordinates, which means – most of the buffers.
But what if I'm missing some very common case of using attributes, say, blood decals, which will require me to update only a small piece of the buffer? In that case, it would be much more reasonable to have two buffers for each object: a common one that is shared by them all and the one for blood decals, which is unique for every single of them. And, as blood is usually spilled on everything, this sounds pretty reasonable, so that we would save a lot of space by storing vertices, normals and such without their unnecessary duplication.
I haven't tried implementing decals yet, so honestly not even sure if implementing them using vertex painting (textured or not) is the right choice. But I'm also pretty sure there are some commonly used attributes aside from vertices, normals and texture coordinates.
Here are some that I managed to come up with myself:
decals (probably better to be modelled as separate objects?)
bullet holes and such (same as decals maybe?)
Any thoughts?
UPD: as all this might sound confusing, I want to clarify: I do understand that using as few buffers as possible is a good thing, this is exactly why I'm trying to use this templates concept. My question is: what are the possible cases when using a single buffer and a single element buffer (with both of them shared between similar objects) for a template is going to stab me in the back?
Keeping a giant chunk of data that won't change on the card is incredibly useful for saving bandwidth. Additionally, you probably won't be directly changing the vertices positions once they are on the card. Instead you will probably morph them with passed in uniforms in the Vertex shader through Skeletal animation. Read about it here: Skeletal Animation
Do keep in mind though, that in Key frame animation with meshes, you would keep a bunch of buffers on the card each in a different key frame pose of the animation. However, you would then load whatever two key frames you want to interpolate over in as attributes and then blend between them (You can have more than two). Keyframe Animation
Additionally, with the introduction of Transformation Feedback, (No you don't get to use it in WebGL, it became core in OpenGL 3.0, WebGL is based on OpenGL ES 2.0, which is based on OpenGL 2.0) you can start keeping calculated data GPU side. In other words, you can do a giant particle system simulation in the vertex or geometry shader and then store the calculated data into another buffer, then use that buffer in the next frame without having to have a round trip from the GPU to CPU Read about them here: Transform Feedback and here: Transform Feedback how to
In general, you don't want to touch buffers once they are on the card, especially every frame. Instead load several and use pointers to that data in shaders as attributes.
This is for an OpenGL ES 2.0 game on Android, though I suspect the right answer is generic to any opengl situation.
TL;DR - is it better to send N data to the gpu once and then make K draw calls with it; or send K*N data to the gpu once, and make 1 draw call?
More Details I'm wondering about best practices for my situation. I have a dynamic mesh whose vertices I recompute every frame - think of it as a water surface - and I need to project these vertices onto K different quads in my game. (In each case the projection is slightly different; sparing details, you could imagine them as K different mirrors surrounding the mesh.) K is in the order of 10-25; I'm still figuring it out.
I can think of two broad options:
Bind the mesh as is, and call draw K different times, either
changing a uniform for shaders or messing with the fixed function
state to render to the correct quad in place (on the screen) or to different
segments of a texture (which I can later use when rendering the quads to achieve
the same effect).
Duplicate all the vertices in the mesh K times, essentially making a
single vertex stream with K meshes in it, and add an attribute (or
few) indicating which quad each mesh clone is supposed to project
onto (and how to get there), and use vertex shaders to project. I
would make one call to draw, but send K times as much data.
The Question: of those two options, which is generally better performance wise?
(Additionally: is there a better way to do this?
I had considered a third option, where I rendered the mesh details to a texture, and created my K-clone geometry as a sort of dummy stream, which I could bind once and for all, that looked up in a vertex shader into the texture for each vertex to find out what vertex it really represented; but I've been told that texture support in vertex shaders is poor or prohibited in OpenGL ES 2.0 and would prefer to avoid that route.)
There is no perfect answer to this question, though I would suggest you think about the nature of real-time computer graphics and the OpenGL pipeline. Although "the GL" is required to produce results that are consistent with in-order execution, the reality is that GPUs are highly parallel beasts. They employ lots of tricks that work best if you actually have many unrelated tasks going on at the same time (some even split the whole pipeline up into discrete tiles). GDDR memory, for instance is really high latency, so for efficiency GPUs need to be able to schedule other jobs to keep the stream processors (shader units) busy while memory is fetched for a job that is just starting.
If you are recomputing parts of your mesh each frame, then you will almost certainly want to favor more draw calls over massive CPU->GPU data transfers every frame. Saturating the bus with unnecessary data transfers plagues even PCI Express hardware (it is far slower than the overhead that several additional draw calls would ever add), it can only get worse on embedded OpenGL ES systems. Having said that, there is no reason you could not simply do glBufferSubData (...) to stream in only the affected portions of your mesh and continue to draw the entire mesh in a single draw call.
You might get better cache coherency if you split (or partition the data within) the buffer and/or draw calls up, depending on your actual use case scenario. The only way to decisively tell which is going to work better in your case is to profile your software on your target hardware. But all of this fail to look at the bigger picture, which is: "Why am I doing this on the CPU?!"
It sounds like what you really want is simply vertex instancing. If you can re-work your algorithm to work completely in vertex shaders by passing instance IDs you should see a massive improvement over all of the solutions I have seen you propose so far (true instancing is actually somewhere between what you described in solutions 1 and 2) :)
The actual concept of instancing is very simple and will give you benefits whether your particular version of the OpenGL API supports it at the API level or not (you can always implement it manually with vertex attributes and extra vertex buffer data). The thing is, you would not have to duplicate your data at all if you implement instancing correctly. The extra data necessary to identify each individual vertex is static, and you can always change a shader uniform and make an additional draw call (this is probably what you will have to do with OpenGL ES 2.0, since it does not offer glDrawElementsInstanced) without touching any vertex data.
You certainly will not have to duplicate your vertices K*N times, your buffer space complexity would be more like O (K + K*M), where M is the number of new components you had to add to uniquely identify each vertex so that you could calculate everything on the GPU. For "instance," you might need to number each of the vertices in your quad 1-4 and process the vertex differently in your shader depending on which vertex you're processing. In this case, the M coefficient is 1 and it does not change no matter how many instances of your quad you need to dynamically calculate each frame; N would determine the number of draw calls in OpenGL ES 2.0, not the size of your data. None of this additional storage space would be necessary in OpenGL ES 2.0 if it supported gl_VertexID :(
Instancing is the best way to make effective use of the highly-parallel GPU and avoid CPU/GPU synchronization and slow bus transfers. Even though OpenGL ES 2.0 does not support instancing in the API sense, multiple draw calls using the same vertex buffer where the only thing you change between calls are a couple of shader uniforms is often preferable to computing your vertices on the CPU and uploading new vertex data every frame or having your vertex buffer's size depend directly on the number of instances you intend to draw (yuck). You'll have to try it out and see what your hardware likes.
Instancing would be what you are looking for but unfortunately it is not available with OpenGL ES 2.0. I would be in favor of sending all the vertices to the GPU and make one draw call if all your assets can fit into the GPU. I have an experience of reducing draw calls from 100+ to 1 and the performance went from 15 fps to 60 fps.
I'm adding an OpenGL renderer to my 2D game engine and I want to know whether there is a way to apply an mvp matrix only to part of the vertices in a single draw call?
I'm planning to group draw calls by textures so I'll pass a buffer of many vertices and texcoords, now I want to apply different rotation angles to different quads. Is there a way to accomplish it in the shader or should I give up on the mvp matrix in the shader and perform the same thing using the cpu?
EDIT: What about adding 3 float attributes (rotation and rot_center.xy) per vertex?
what's better performance
(1) doing CPU rotation?
(2) providing 3 more floats per vertex
(3) separating draw calls?
Is there any other option?
Here is a possibility:
Do the rotation in the vertex shader. Pass in the information (angle?) needed to create the rotation matrix as a vertex attribute.
Pass in a vertex attribute (ubyte) that is effectively a per-vertex boolean flag. Rotation in #1 will be executed only if the bool is set.
Not sure if the above will work for you from a performance/storage perspective.
I think that, while it is a good thing to group draw calls for many different performance reasons, changing your code to satisfy a basic requirement as rotation is not a good idea.
Drawing batching is a good thing but, if you are forced to keep an additional attribute (because you cannot do it with uniforms for sure, you wouldn't have the information of the single entity) it is not worth.
An additional attribute means much more memory bandwidth usage that usually is the main killing factor for performances on nowadays systems.
Drawing batching, on the other side, is important but not always critical, it depends on many factors such as:
the GPU OpenGL driver optimization
The GPU tiles configuration
The number of shapes/draw calls we are talking about (if you have 20 quads on the screen, why should you bother of batching? :) )
In other words, often it is much more convenient to drop extreme batching in favor of easiness/main tenability and avoid fancy solutions for simple requirements as rotation.
I hope this helps in some way.
Use two different objects, that is all!
There is no other workaround for rotation of part of object
Example:
A game with a tank, where you want to rotate turret and remaining-body separately. Like in your case here these two are treated as separate objects.
I'm using an ParticleSystem with PointSprites (inspired by the Cocos2D Source). But I wonder how to rebuild the functionality for OpenGL ES 2.0
glEnable(GL_POINT_SPRITE_OES);
glEnableClientState(GL_POINT_SIZE_ARRAY_OES);
glPointSizePointerOES(GL_FLOAT,sizeof(PointSprite),(GLvoid*) (sizeof(GL_FLOAT)*2));
glDisableClientState(GL_POINT_SIZE_ARRAY_OES);
glDisable(GL_POINT_SPRITE_OES);
these generate BAD_ACCESS when using an OpenGL ES 2.0 context.
Should I simply go with 2 TRIANGLES per PointSprite? But thats probably not very efficent (overhead for extra vertexes).
EDIT:
So, my new problem with the suggested solution from:
https://gamedev.stackexchange.com/questions/11095/opengl-es-2-0-point-sprites-size/15528#15528
is a possibility to pass many different sizes in an batch call. I thought of using an Attribute instead of an Uniform, but then I would need to pass always an PointSize to my shaders - even if I'm not drawing GL_POINTS. So, maybe a second shader (a shader only for GL_POINTS)?! I'm not aware of the overhead for switching shaders every frame in the draw routine (because if the particle system is used, I want naturally also render regular GL_TRIANGLES without an pointSize)... Any ideas on this?
So doing the thing here as I already commented here is what you need: https://gamedev.stackexchange.com/questions/11095/opengl-es-2-0-point-sprites-size/15528#15528
And for which approach to go, I can either tell you to use different shaders for different types of drawables in your application or just another boolean uniform in your shader and enable and disable changing the gl_PointSize through your shader code. It's usually up to you. What you need to keep in mind is changing the shader program is one of the most time costly operations so doing the drawing of same type of objects in a batch will be better in that case. I'm not really sure if using an if statement in your shader code will give a huge performance impact.