WebGL - Is the gl.useProgram() an expensive call to make? - opengl-es

I'm wondering if this API method used to load a shader program is expensive or not to call?
Im considering making this call per object in my 3D scene.
gl.useProgram(shaderProgram);
thanks

this says it's better to use glUseProgram than is to glAttachShader+glLinkProgram
this says that changing shaders is always heavy, but glUseProgram is least heavy
this says it's generally efficient
this says that moderate and careful use won't become a bottleneck
this talks a bit about shader-switching performance optimization
So, conslusion: use it moderately. If you don't have much objects - great, if you do - try optimizing shader switching, or reusing shaders multiple times, or use same shader that utilizes branching somehow.
useProgram has got medium performance hit. It's not super-light, neither it's super-heavy like linkProgram and compileShader.
Hope this helps.

Related

Display list vs. VAO performance

I recently implemented functionality in my rendering engine to make it able to compile models into either display lists or VAOs based on a runtime setting, so that I can compare the two to each other.
I'd generally prefer to use VAOs, since I can make multiple VAOs sharing actual vertex data buffers (and also since they aren't deprecated), but I find them to actually perform worse than display lists on my nVidia (GTX 560) hardware. (I want to keep supporting display lists anyway to support older hardware/drivers, however, so there's no real loss in keeping the code for handling them.)
The difference is not huge, but it is certainly measurable. As an example, at a point in the engine state where I can consistently measure my drawing loop using VAOs to take, on a rather consistent average, about 10.0 ms to complete a cycle, I can switch to display lists and observe that cycle time decrease to about 9.1 ms on a similarly consistent average. Consistent, here, means that a cycle normally deviates less than ±0.2 ms, far less than the difference.
The only thing that changes between these settings is the drawing code of a normal mesh. It changes from the VAO code whose OpenGL calls look simply thusly...
glBindVertexArray(id);
glDrawElements(GL_TRIANGLES, num, GL_UNSIGNED_SHORT, NULL); // Using an index array in the VAO
... to the display-list code which looks as follows:
glCallList(id);
Both code paths apply other states as well for various models, of course, but that happens in the exact same manner, so those should be the only differences. I've made explicitly sure to not unbind the VAO unnecessarily between draw calls, as that, too, turned out to perform measurably worse.
Is this behavior to be expected? I had expected VAOs to perform better or at least equally to display lists, since they are more modern and not deprecated. On the other hand, I've been reading on the webs that nVidia's implementation has particularly well optimized display lists and all, so I'm thinking perhaps their VAO implementation might still be lagging behind. Has anyone else got findings that match (or contradict) mine?
Otherwise, could I be doing something wrong? Are there any known circumstances that make VAOs perform worse than they should, on nVidia hardware or in general?
For reference, I've tried the same differences on an Intel HD Graphics (Ironlake) as well, and there it turned out that using VAOs performed just as well as simply rendering directly from memory, while display lists were much worse than either. I wish I had AMD hardware to try on, but I don't.

Low Flash AS3 Performance

I'm working on a game coded in AS3 using the Alternativa3D 7.8 engine and it just doesn't have the FPS I was hoping to achieve with it and I'm trying to fully understand why. I get it that having 3D objects in a scene can be very taxing on performance but I'm using only a very limited number of 3d objects and each of those has a relatively small polygon count.
I'm wondering if there is something else like a memory leak causing this on top of the actual rendering of the scene.
I'd like to figure out a way to view how the performance is being distributed in my code to see if there are certain areas that are causing this. I usually only get about 10-15 FPS on my computer and I'd like to get that to around a constant 20-24 or higher if possible.
I don't think that this question should be downvoted necessarily, though it is a bit broad. OP is asking about general performance tips for AS3 applications.
It's true that we can't give him specific pointers without seeing his code, but we could still provide him with more general tips/tricks. Here's some analysis, pretty general:
I don't think your performance problems necessarily have anything to do with your 3D, though they might. The instant the game world comes on screen even the mouse movement is tremendously slowed, whereas the instant I pause it the framerate improves - which suggests to me that you are doing a lot of iteration and calculation on every frame.
I'd start with this: do you have any computationally intensive loops going on inside of your main game loop? For instance, I see that you're working with sea level as it effects landmass - are you doing something like calculating all of your water properties on every frame?
Having a lot of "3D" objects isn't necessarily a problem, because a 3D object is just a set of points. They're more intensive to position than 2d objects because you're including an additional dimension, but not so much more intensive that a few 3d objects would cause this kind of performance. I don't think that they are your problem (though I could be wrong).
Rather, it's what kind of calculations you're performing. Look for loops, figure out what you can comment out and instantly see better performance, and then once you've isolated it see what you can do about caching the outputs of those computations so that you don't have to recalc them on every frame.
Cheers,
mb

Performance comparision

In Opengl es, I come to know two ways to do light effects,
1> using light map
2> using stencil buffers
Which is more efficient way in term of performance comparision?
The answer is it depends. If you are heavily using stencilling, then your application may become fill rate limited. Doing a texture lookup on a light map can slow down your fragment/vertex shaders, also making your application fill rate limited. Generally though, light maps are more efficient, as they don't involve extra passes like most stencilling effects do. Moral of the story is, only bench marking will accuratly tell you what is more efficient.

Efficiency of branching in shaders

I understand that this question may seem somewhat ungrounded, but if someone knows anything theoretical / has practical experience on this topic, it would be great if you share it.
I am attempting to optimize one of my old shaders, which uses a lot of texture lookups.
I've got diffuse, normal, specular maps for each of three possible mapping planes and for some faces which are near to the user I also have to apply mapping techniques, which also bring a lot of texture lookups (like parallax occlusion mapping).
Profiling showed that texture lookups are the bottleneck of the shader and I am willing to remove some of them away. For some cases of the input parameters I already know that part of the texture lookups would be unnecessary and the obvious solution is to do something like (pseudocode):
if (part_actually_needed) {
perform lookups;
perform other steps specific for THIS PART;
}
// All other parts.
Now - here comes the question.
I do not remember exactly (that's why I stated the question might be ungrounded), but in some paper I recently read (unfortunately, can't remember the name) something similar to the following was stated:
The performance of the presented
technique depends on how efficient
the HARDWARE-BASED CONDITIONAL
BRANCHING is implemented.
I remembered this kind of statement right before I was about to start refactoring a big number of shaders and implement that if-based optimization I was talking about.
So - right before I start doing that - does someone know something about the efficiency of the branching in shaders? Why could branching give a severe performance penalty in shaders?
And is it even possible that I could only worsen the actual performance with the if-based branching?
You might say - try and see. Yes, that's what I'm going to do if nobody here is helps me :)
But still, what in the if case may be effective for new GPU's could be a nightmare for a bit older ones. And that kind of issue is very hard to forecast unless you have a lot of different GPU's (that's not my case)
So, if anyone knows something about that or has benchmarking experience for these kinds of shaders, I would really appreciate your help.
Few remaining brain cells that are actually working keep telling me that branching on the GPU's might be far not as effective as branching for the CPU (which usually has extremely efficient ways of branch predictions and eliminating cache misses) simply because it's a GPU (or that could be hard / impossible to implement on the GPU).
Unfortunately I am not sure if this statement has anything in common with the real situation...
If the condition is uniform (i.e. constant for the entire pass), then the branch is essentially free because the framework will essentially compile two versions of the shader (branch taken and not) and choose one of these for the entire pass based on your input variable. In this case, definitely go for the if statement as it will make your shader faster.
If the condition varies per vertex/pixel, then it can indeed degrade performance and older shader models don't even support dynamic branching.
Unfortunately, I think the real answer here is to do practical testing with a performance analyser of your specific case, on your target hardware. Particularly given that it sounds like you're at project optimisation stage; this is the only way to take into account the fact that hardware changes frequently and the nature of the specific shader.
On a CPU, if you get a mispredicted branch, you'll cause a pipeline flush and since CPU pipelines are so deep, you'll effectively lose something in the order of 20 or more cycles. On the GPU things a little different; the pipeline are likely to be far shallower, but there's no branch prediction and all of the shader code will be in fast memory -- but that's not the real difference.
It's difficult to know the exact details of everything that's going on, because nVidia and ATI are relatively tight-lipped, but the key thing is that GPUs are made for massively parallel execution. There are many asynchronous shader cores, but each core is again designed to run multiple threads. My understanding is that each core expects to run the same instruction on all it's threads on any given cycle (nVidia calls this collection of threads a "warp").
In this case, a thread might represent a vertex, a geometry element or a pixel/fragment and a warp is a collection of about 32 of those. For pixels, they're likely to be pixels that are close to each other on screen. The problem is, if within one warp, different threads make different decisions at the conditional jump, the warp has diverged and is no longer running the same instruction for every thread. The hardware can handle this, but it's not entirely clear (to me, at least) how it does so. It's also likely to be handled slightly differently for each successive generation of cards. The newest, most general CUDA/compute-shader friendly nVidias might have the best implementation; older cards might have a poorer implementation. The worse case is you may find many threads executing both sides of if/else statements.
One of the great tricks with shaders is learning how to leverage this massively parallel paradigm. Sometimes that means using extra passes, temporary offscreen buffers and stencil buffers to push logic up out of the shaders and onto the CPU. Sometimes an optimisation may appear to burn more cycles, but it could actually be reducing some hidden overhead.
Also note that you can explicitly mark if statements in DirectX shaders as [branch] or [flatten]. The flatten style gives you the right result, but always executes all in the instructions. If you don't explicitly choose one, the compiler can choose one for you -- and may pick [flatten], which is no good for your example.
One thing to remember is that if you jump over the first texture lookup, this will confuse the hardware's texture coordinate derivative math. You'll get compiler errors and it's best not to do so, otherwise you might miss out on some of the better texturing support.
In many cases the both branches could be calculated and mixed by condition as interpolator.
That approach works much faster than branch. Could be used on CPU also.
For instance:
...
vec3 c = vec3(1.0, 0.0, 0.0);
if (a == b)
c = vec3(0.0, 1.0, 0.0);
could be replaced by:
vec3 c = mix(vec3(1.0, 0.0, 0.0), vec3(0.0, 1.0, 0.0), (a == b));
...
Here's a real world performance benchmark on a kindle Fire:
In the fragment shader...
This runs at 20fps:
lowp vec4 a = vec4(0.0, 0.0, 0.0, 0.0);
if (a.r == 0.0)
gl_FragColor = texture2D ( texture1, TextureCoordOut );
This runs at 60fps:
gl_FragColor = texture2D ( texture1, TextureCoordOut );
I don't know about the if-based optimizations, but how about just creating all the permutations of the texture-lookups that you think you'll need, each its own shader, and just use the right shader for the right situation (depending on which texture lookups a particular model, or part of your model, needed). I think we did something like this on Bully for Xbox 360.

Site on OpenGL call performance

I'm searching for reliable data on OpenGL's functions performance. A site that could for example:
...answer me how much more efficient is using glInterleavedArrays compared to gl*Pointer based implementation with strides, or without them. If applicable, show the comparisions on nVidia vs. ATI cards vs. embedded systems.
...answer me how much of a boost is gained in using VBO's vs. non-buffered data in the cases of static, dynamic and stream data.
I'd like to find a site that has "no-bullshit" performance data, not just vague statements like "glInterleavedArrays are usually faster than direct gl*Pointer usage".
Is there such a dream-site? Or at least somewhere where I can get answers to the forementioned questions?
(yes, I know that nothing will beat hand-profiling, but the fact that something works faster on my machine, doesn't mean it's faster generally on all cards...)
It's more about application level benchmarking than measuring performance of individual features, but it might be possible to learn something from specviewperf, especially if it's possible to discover more about what OpenGL mode each benchmark uses to perform it's rendering. The benchmark seems to include some options to tweak usage of display lists, vertex arrays etc, but I don't think SPECs published results go into any analysis of the effects of changing these from the defaults. They don't seem to have any VBO coverage yet.

Resources