OpenGL ES 2.0: glUseProgram vs glUniform performance

OpenGL ES 2.0: glUseProgram vs glUniform performance - opengl-es

Which is faster, a single call to glUseProgram, or sending e.g. 6 or so floats via glUniform (batched or separately), and by approximately how much?

Can you describe in more detail the scenario where you think this affects the performance of the rendering pipeline? They offer completely different functionalities and I don't see why you would care about the performance of glUseProgram vs glUniform.
Now let's analyze what happens when you use this functions to get an idea of their cost.
When you call glUseProgram it changes several OpenGL rendering states because we are going to use new shaders attached to the program object. The specification says that vertex and fragment programs are installed in the processors when you invoke this function. That alone seems costly enough to overshadow the cost of glUniform. Also, when you install new vertex and fragment programs, additional states of the rendering pipeline are changed to accomodate the number of texture units and data layout used by the programs.
glUniform copies data from one location of memory to another to specify the value of an uniform variable. The worst case would be copying matrices which seems less complex than glUseProgram.
But in the end, it all depends of the amount of data you are transferring with glUniform and the underlying implementation of glUseProgram (it could be super optimized by the driver and have a very small cost) and if your engine is smart enough to group the geometry that uses the same program and draw it without changing states.

Related

Can I avoid texture gradient calculations in webgl?

We have a webgl/three.js application that makes extensive use of texture buffers for passing data between passes and for storing arrays of data. None of these has any use for mipmaps. We are easily able to prevent mipmap generation: at the three.js level we set min and mag filters to NearestFilter, and set generateMipmaps false.
However, the shaders do not know at compile time that there is no mipmapping. When compiled using ANGLE we get a lot of warning messages:
warning X4121: gradient-based operations must be moved out of flow control to prevent divergence. Performance may improve by using a non-gradient operation
I have recoded so that the flow around such lookups is (optionally) avoided.
On my Windows/NVidia machine using the conditional flows improves performance and does not cause any visual issues (but does cause the messages).
I don't want the texture lookups to be gradient-based operations. What I would like to do is to write the shaders in such a way that they know at compile time that there is no decision to be made; which should (marginally) improve performance and also make the messages go away. However, I cannot see any way to do this in GLSL for GLES 2 (as used by webgl). It can be done in later versions with textureLodOffset() and various other ways. The only control in level 2 I can see is the bias option on texture2D(), but that is a bias not an absolute value and so does not resolve the issue. So, finally ...
Question: Do you know any way to prevent lod calculation in WEBGL level GLSL shaders?

You might try ensuring:
Using gl_FragCoord instead of a user varying
NEAREST is set before texImage2d, instead of after

Which GLSL Multi Colour Linear/Radial Gradients Strategy to use?

I'm developing using OpenGL ES 2 & GLSL and I'm stuck on how to approach multi coloured / fractioned gradients ( linear and radial ).
I don't know which approach is the best practice:
Get a texture of the gradient colours & sample this in the fragment Shader ( essentially working with a regular texture ).
Computer generate a texture of the gradient first & sample this in the fragment Shader as above ( no need for PNGs etc of the gradient ) - caching this texture to save regeneration.
Use interpolation in the fragment Shader to calculate the fragment value by fragment position - this looks like I'd have to use multiple ifs, a loop, stuff you don't want executed per fragment.
Other strategy I haven't conceived of.
I know this question is a bit on the subjective side, but having looked around online for information I've not found anything concrete about how to proceed...

Well, I can tell you how to proceed, but you may not like the answer. ;) The main two approaches are sampling a texture, or doing shader calculations. To decide which one is more efficient in your case, you need to implement both, and start benchmarking. There are way too many factor influencing the performance of each to give a generic answer.
One of the major factors is of course how complex your calculations are. But modern GPUs have very high raw performance for pure calculations. Not quite as much for the mobile GPUs you're most likely using since you're asking about ES, but even the latest mobile GPUs have become quite powerful. Branches aren't free, but not necessarily as harmful as you might expect.
On the other hand, texture sampling looks like a single operation in the shader, but based on that alone you should not assume that it's automatically faster than executing a bunch of computations. Texture sampling performance can be limited by many factors, including throughput of the texture sampling hardware units, memory bandwidth, cache hit rates, etc. Particularly if your textures need to be fairly large to give you the necessary precision, memory bandwidth can hurt you, and accessing memory on a mobile device consumes significant power. Also, just the additional memory usage is undesirable since you mostly deal with very constrained amounts of memory.
Of course the performance characteristics can vary greatly between different GPUs. So if you want to make reliable conclusions, you need to benchmark on a variety of devices.
For the approach where you implement the computations in the shader, make sure that it is as optimal as it can be. Avoid branches where reasonably possible, or at least benchmark various options to see how much the branches hurt performance. If there are parts of the computation that are the same for each fragment, pre-compute the values and pass them into the shader. Replace expensive operations by cheaper ones where possible. For example, instead of dividing by a uniform value, pass in the inverse as a uniform, and use a multiplication instead. Use vector operations where possible.

When to store quaternion vs matrix in static and dynamic objects (data structure design)

My question is about design and possible suggestions for the following scenario:
I am writing a 3d visualizer. For my renderable objects I would like to store the minimum data possible (so quaternions are naturally nice for rotation).
At some point I must extract a Matrix for rendering which requires computation and temporary storage on every frame update (even for objects that do not change spatially).
Given that many objects remain static and don't need to be rotated locally would it make sense to store the matrix instead and thereby avoid the computation for each object each frame? Is there any best practice approach to this perhaps from a game engine design point of view?
I am currently a bit torn between storing the two extremes of either position+quaternion or 4x3/4x4 matrix. Looking at openframeworks (not necessarily trying to achieve the same goal as me), they seem to do a hybrid where they store a quaternion AND a matrix (matrix always reflects the quaternion) so its always ready when needed but needs to be updated along with every change to the quaternion.

More compact storage require 3 scalars, so Euler Angels or Exponential Maps (Rodrigues) can be used. Quaternions is good compromise between conversion to matrix speed and compactness.
From design point of view , there is a good rule "make all design decisions as LATE as possible". In your case, just incapsulate (isolate) the rotation (transformation) representation, to be able in the future, to change the physical storage of data in different states (file, memory, rendering and more). Also it enables different platform optimization, keep data in GPU or CPU and more.

Been there.
First: keep in mind the omnipresent struggle of time against space (in computer science processing time against memory requirements)
You said that want to keep minimum information possible at first (space), and next talked about some temporary matrix reflecting the quartenions, which is more of a time worry.
If you accept a tip, I would go for the matrices. They are generally performance wise standard for 3D graphics and it's size becomes easily irrelevant next to the object data itself.
Just to have and idea: in most GPUs transforming an vector for the identity (no change) is actually faster then checking if it needs transformation and then doing nothing.
As for engines, I can't think of one that does not apply the transformations for every vertex every frame. Even if the objects keep in place, they position has to go through projection and view matrices.
(does this answer? Maybe I got you wrong)

DrawPrimitives performance

I want to draw single faces instead of xna models because it's too slow.
But I don't know what the difference is between
DrawPrimitives
DrawUserPrimitives
DrawIndexedPrimitives
DrawUserIndexedPrimitives
Which one is the fastest method? And what are the indices good for?

The simple answer to your question is that the "User" versions are a fair bit slower on the CPU because they have to transfer vertex data to the GPU (via the driver and the bus) each time they are called.
The non-User versions use vertex and index buffers that already exist on the GPU (you put them there at load time). They have considerably less data to transfer, so they are faster.
The "User" and "Indexed" versions will also each have a performance impact on the GPU. This impact is relatively tiny. Generally speaking you don't need to worry about it.
The User versions exist because they are faster when your data changes each frame. There is also DynamicVertexBuffer which can be used with the non-User version of the draw functions. I believe it is slightly faster than the User methods in cases where you can pre-allocate the buffer at the desired size.
The Indexed versions allow you to select vertices out of your vertex buffer using an index buffer (so triangles that you draw can choose vertices at any position in the vertex buffer). The alternative is that your vertex buffer is simply interpreted as as sequential list of triangle vertices (based on PrimitiveType). The main reason for the existence of index buffers is to remove the need for duplicate vertices in your vertex buffer (which would require additional memory and processing on the GPU).
BUT...
XNA's Model class internally uses DrawIndexedPrimitives. Not only that, but it uses it correctly (ie: it doesn't draw single faces - but as many as it can at once - for the best performance). So if you are finding that it is slow, then your problem lies elsewhere.
I suggest trying to diagnose the reason why your game is performing poorly, before trying to select a "solution". Maybe ask for help doing that in a question here (or on https://gamedev.stackexchange.com/).

All in one time if you can , Instancied draw will be always better , but that need you to give all the textures in one time ! In my case , for example , I like to draw instancied objects with 1 only texture ... all the trees , all the ground , all buildings , etc ...

Improving raytracer performance

I'm writing a comparatively straightforward raytracer/path tracer in D (http://dsource.org/projects/stacy), but even with full optimization it still needs several thousand processor cycles per ray. Is there anything else I can do to speed it up? More generally, do you know of good optimizations / faster approaches for ray tracing?
Edit: this is what I'm already doing.
Code is already running highly parallel
temporary data is structured in a cache-efficient fashion as well as aligned to 16b
Screen divided into 32x32-tiles
Destination array is arranged in such a way that all subsequent pixels in a tile are sequential in memory
Basic scene graph optimizations are performed
Common combinations of objects (plane-plane CSG as in boxes) are replaced with preoptimized objects
Vector struct capable of taking advantage of GDC's automatic vectorization support
Subsequent hits on a ray are found via lazy evaluation; this prevents needless calculations for CSG
Triangles neither supported nor priority. Plain primitives only, as well as CSG operations and basic material properties
Bounding is supported

The typical first order improvement of raytracer speed is some sort of spatial partitioning scheme. Based only on your project outline page, it seems you haven't done this.
Probably the most usual approach is an octree, but the best approach may well be a combination of methods (e.g. spatial partitioning trees and things like mailboxing). Bounding box/sphere tests are a quick cheap and nasty approach, but you should note two things: 1) they don't help much in many situations and 2) if your objects are already simple primitives, you aren't going to gain much (and might even lose). You can more easily (than octree) implement a regular grid for spatial partitioning, but it will only work really well for scenes that are somewhat uniformly distributed (in terms of surface locations)
A lot depends on the complexity of the objects you represent, your internal design (i.e. do you allow local transforms, referenced copies of objects, implicit surfaces, etc), as well as how accurate you're trying to be. If you are writing a global illumination algorithm with implicit surfaces the tradeoffs may be a bit different than if you are writing a basic raytracer for mesh objects or whatever. I haven't looked at your design in detail so I'm not sure what, if any, of the above you've already thought about.
Like any performance optimization process, you're going to have to measure first to find where you're actually spending the time, then improving things (algorithmically by preference, then code bumming by necessity)

One thing I learned with my ray tracer is that a lot of the old rules don't apply anymore. For example, many ray tracing algorithms do a lot of testing to get an "early out" of a computationally expensive calculation. In some cases, I found it was much better to eliminate the extra tests and always run the calculation to completion. Arithmetic is fast on a modern machine, but a missed branch prediction is expensive. I got something like a 30% speed-up on my ray-polygon intersection test by rewriting it with minimal conditional branches.
Sometimes the best approach is counter-intuitive. For example, I found that many scenes with a few large objects ran much faster when I broke them down into a large number of smaller objects. Depending on the scene geometry, this can allow your spatial subdivision algorithm to throw out a lot of intersection tests. And let's face it, intersection tests can be made only so fast. You have to eliminate them to get a significant speed-up.
Hierarchical bounding volumes help a lot, but I finally grokked the kd-tree, and got a HUGE increase in speed. Of course, building the tree has a cost that may make it prohibitive for real-time animation.
Watch for synchronization bottlenecks.
You've got to profile to be sure to focus your attention in the right place.

Is there anything else I can do to speed it up?
D, depending on the implementation and compiler, puts forth reasonably good performance. As you haven't explained what ray tracing methods and optimizations you're using already, then I can't give you much help there.
The next step, then, is to run a timing analysis on the program, and recode the most frequently used code or slowest code than impacts performance the most in assembly.
More generally, check out the resources in these questions:
Literature and Tutorials for Writing a Ray Tracer
Anyone know of a really good book about Ray Tracing?
Computer Graphics: Raytracing and Programming 3D Renders
raytracing with CUDA
I really like the idea of using a graphics card (a massively parallel computer) to do some of the work.
There are many other raytracing related resources on this site, some of which are listed in the sidebar of this question, most of which can be found in the raytracing tag.

I don't know D at all, so I'm not able to look at the code and find specific optimizations, but I can speak generally.
It really depends on your requirements. One of the simplest optimizations is just to reduce the number of reflections/refractions that any particular ray can follow, but then you start to lose out on the "perfect result".
Raytracing is also an "embarrassingly parallel" problem, so if you have the resources (such as a multi-core processor), you could look into calculating multiple pixels in parallel.
Beyond that, you'll probably just have to profile and figure out what exactly is taking so long, then try to optimize that. Is it the intersection detection? Then work on optimizing the code for that, and so on.

Some suggestions.
Use bounding objects to fail fast.
Project the scene at a first step (as common graphic cards do) and use raytracing only for light calculations.
Parallelize the code.

Raytrace every other pixel. Get the color in between by interpolation. If the colors vary greatly (you are on an edge of an object), raytrace the pixel in between. It is cheating, but on simple scenes it can almost double the performance while you sacrifice some image quality.
Render the scene on GPU, then load it back. This will give you the first ray/scene hit at GPU speeds. If you do not have many reflective surfaces in the scene, this would reduce most of your work to plain old rendering. Rendering CSG on GPU is unfortunately not completely straightforward.
Read source code of PovRay for inspiration. :)

You have first to make sure that you use very fast algorithms (implementing them can be a real pain, but what do you want to do and how far want you to go and how fast should it be, that's a kind of a tradeof).
some more hints from me
- don't use mailboxing techniques, in papers it is sometimes discussed that they don't scale that well with the actual architectures because of the counting overhead
- don't use BSP/Octtrees, they are relative slow.
- don't use the GPU for Raytracing, it is far too slow for advanced effects like reflection and shadows and refraction and photon-mapping and so on ( i use it only for shading, but this is my beer)
For a complete static scene kd-Trees are unbeatable and for dynamic scenes there are clever algorithms there that scale very well on a quadcore (i am not sure about the performance above).
And of course, for a realy good performance you need to use very much SSE code (with of course not too much jumps) but for not "that good" performance (im talking here about 10-15% maybe) compiler-intrinsics are enougth to implement your SSE stuff.
And some decent Papers about some Algorithms i was talking about:
"Fast Ray/Axis-Aligned Bounding Box - Overlap Tests using Ray Slopes"
( very fast very good paralelisizable (SSE) AABB-Ray hit test )( note, the code in the paper is not all code, just google for the title of the paper, youll find it)
http://graphics.tu-bs.de/publications/Eisemann07RS.pdf
"Ray Tracing Deformable Scenes using Dynamic Bounding Volume Hierarchies"
http://www.sci.utah.edu/~wald/Publications/2007///BVH/download//togbvh.pdf
if you know how the above algorithm works then this is a much greater algorithm:
"The Use of Precomputed Triangle Clusters for Accelerated Ray Tracing in Dynamic Scenes"
http://garanzha.com/Documents/UPTC-ART-DS-8-600dpi.pdf
I'm also using the pluecker-test to determine fast (not thaat accurate, but well, you can't have all) if i hit a polygon, works very pretty with SSE and above.
So my conclusion is that there are so many great papers out there about so much Topics that do relate to raytracing (How to build fast, efficient trees and how to shade (BRDF models) and so on and so on), it is an realy amazing and interesting field of "experimentating", but you need to have also much sparetime because it is so damn complicated but funny.

My first question is - are you trying to optimize the tracing of one single still screen,
or is this about optimizing the tracing of multiple screens in order to calculate an animation ?
Optimizing for a single shot is one thing, if you want to calculate successive frames in an animation there are lots of new things to think about / optimize.

You could
use an SAH-optimized bounding volume hierarchy...
...eventually using packet traversal,
introduce importance sampling,
access the tiles ordered by Morton code for better cache coherency, and
much more - but those were the suggestions I could immediately think of. In more words:
You can build an optimized hierarchy based on statistics in order to quickly identify candidate nodes when intersecting geometry. In your case you'll have to combine the automatic hierarchy with the modeling hierarchy, that is either constrain the build or have it eventually clone modeling information.
"Packet traversal" means you use SIMD instructions to compute 4 parallel scalars, each of an own ray for traversing the hierarchy (which is typically the hot spot) in order to squeeze the most performance out of the hardware.
You can perform some per-ray-statistics in order to control the sampling rate (number of secondary rays shot) based on the contribution to the resulting pixel color.
Using an area curve on the tile allows you to decrease the average space distance between the pixels and thus the probability that your performance benefits from cache hits.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio