We have a webgl/three.js application that makes extensive use of texture buffers for passing data between passes and for storing arrays of data. None of these has any use for mipmaps. We are easily able to prevent mipmap generation: at the three.js level we set min and mag filters to NearestFilter, and set generateMipmaps false.
However, the shaders do not know at compile time that there is no mipmapping. When compiled using ANGLE we get a lot of warning messages:
warning X4121: gradient-based operations must be moved out of flow control to prevent divergence. Performance may improve by using a non-gradient operation
I have recoded so that the flow around such lookups is (optionally) avoided.
On my Windows/NVidia machine using the conditional flows improves performance and does not cause any visual issues (but does cause the messages).
I don't want the texture lookups to be gradient-based operations. What I would like to do is to write the shaders in such a way that they know at compile time that there is no decision to be made; which should (marginally) improve performance and also make the messages go away. However, I cannot see any way to do this in GLSL for GLES 2 (as used by webgl). It can be done in later versions with textureLodOffset() and various other ways. The only control in level 2 I can see is the bias option on texture2D(), but that is a bias not an absolute value and so does not resolve the issue. So, finally ...
Question: Do you know any way to prevent lod calculation in WEBGL level GLSL shaders?
You might try ensuring:
Using gl_FragCoord instead of a user varying
NEAREST is set before texImage2d, instead of after
Related
I have a 48bit texture RGB16F.
https://www.khronos.org/registry/OpenGL-Refpages/es3.0/html/glTexImage2D.xhtml
states that when using RGB. 1.0 will be put into the alpha channel.
Is 1.0 implicit or actually stored?
And in the latter case. My main question:
If i put my 16bit heightmap into the alpha channel, so it becomes RGBA16F.
Will I improve performance?
All insights are welcome.
Is 1.0 implicit or actually stored?
That's implementation specific. If you were asking about 888 vs 8888 textures, I'd tell you that pretty much every implementation is bound to use 32 bits per texel, but I'm not so sure for 16F formats. It is telling that Metal doesn't define an RGB16F format (link) which strongly suggests that PowerVR GPUs at least will pad the format. Vulkan does define RGB16F, but while the spec requires support for R16F, RG16F and RGBA16F it doesn't require support for RGB16F (link), again suggesting lack of native support by some vendors. I wouldn't be surprised if some GPU somewhere does support RGB16F, but I suspect most would just pad. For a more definitive answer you might need to post questions on the GPU forums or experiment by examining memory usage in some controlled conditions.
And in the latter case. My main question: If i put my 16bit heightmap into the alpha channel, so it becomes RGBA16F. Will I improve performance?
Are you sampling it at the same time (i.e. from the same shader, with the same UVs)? If so, then yes absolutely it will be a better choice than using an RGB16F plus a R16F. If they're not sampled together (e.g. the heightmap is sampled in the vertex shader, the colour in the fragment shader), then it's harder to guess. Probably you'd be harming performance on the heightmap fetch (those extra bytes blowing the cache), but leaving the colour fetch unharmed (there was padding there anyway) - overall you'd lose some performance but save some memory - any performance loss is probably pretty minor and if your bottleneck lies elsewhere it may not do any harm at all.
Is 1.0 implicit or actually stored?
I suspect "both", although perhaps not in the way you mean.
Most GPU samplers support implicit rules for missing channels (0.0 for color, 1.0 for alpha), and using these is lower power than sampling / filtering from memory, so I would expect this to use implict loads for the missing channels.
However, hardware is also usually allergic to loading things which are not a power of two in size (things which span cache line boundaries typically take two cycles to load on most cache architectures), so I would also expect each texel to be padded out to 64-bits each. What the 16-bits of padding contains may not be 1.0, as the hardware doesn't care because it's using implicit rules.
I'm developing using OpenGL ES 2 & GLSL and I'm stuck on how to approach multi coloured / fractioned gradients ( linear and radial ).
I don't know which approach is the best practice:
Get a texture of the gradient colours & sample this in the fragment Shader ( essentially working with a regular texture ).
Computer generate a texture of the gradient first & sample this in the fragment Shader as above ( no need for PNGs etc of the gradient ) - caching this texture to save regeneration.
Use interpolation in the fragment Shader to calculate the fragment value by fragment position - this looks like I'd have to use multiple ifs, a loop, stuff you don't want executed per fragment.
Other strategy I haven't conceived of.
I know this question is a bit on the subjective side, but having looked around online for information I've not found anything concrete about how to proceed...
Well, I can tell you how to proceed, but you may not like the answer. ;) The main two approaches are sampling a texture, or doing shader calculations. To decide which one is more efficient in your case, you need to implement both, and start benchmarking. There are way too many factor influencing the performance of each to give a generic answer.
One of the major factors is of course how complex your calculations are. But modern GPUs have very high raw performance for pure calculations. Not quite as much for the mobile GPUs you're most likely using since you're asking about ES, but even the latest mobile GPUs have become quite powerful. Branches aren't free, but not necessarily as harmful as you might expect.
On the other hand, texture sampling looks like a single operation in the shader, but based on that alone you should not assume that it's automatically faster than executing a bunch of computations. Texture sampling performance can be limited by many factors, including throughput of the texture sampling hardware units, memory bandwidth, cache hit rates, etc. Particularly if your textures need to be fairly large to give you the necessary precision, memory bandwidth can hurt you, and accessing memory on a mobile device consumes significant power. Also, just the additional memory usage is undesirable since you mostly deal with very constrained amounts of memory.
Of course the performance characteristics can vary greatly between different GPUs. So if you want to make reliable conclusions, you need to benchmark on a variety of devices.
For the approach where you implement the computations in the shader, make sure that it is as optimal as it can be. Avoid branches where reasonably possible, or at least benchmark various options to see how much the branches hurt performance. If there are parts of the computation that are the same for each fragment, pre-compute the values and pass them into the shader. Replace expensive operations by cheaper ones where possible. For example, instead of dividing by a uniform value, pass in the inverse as a uniform, and use a multiplication instead. Use vector operations where possible.
My question is about design and possible suggestions for the following scenario:
I am writing a 3d visualizer. For my renderable objects I would like to store the minimum data possible (so quaternions are naturally nice for rotation).
At some point I must extract a Matrix for rendering which requires computation and temporary storage on every frame update (even for objects that do not change spatially).
Given that many objects remain static and don't need to be rotated locally would it make sense to store the matrix instead and thereby avoid the computation for each object each frame? Is there any best practice approach to this perhaps from a game engine design point of view?
I am currently a bit torn between storing the two extremes of either position+quaternion or 4x3/4x4 matrix. Looking at openframeworks (not necessarily trying to achieve the same goal as me), they seem to do a hybrid where they store a quaternion AND a matrix (matrix always reflects the quaternion) so its always ready when needed but needs to be updated along with every change to the quaternion.
More compact storage require 3 scalars, so Euler Angels or Exponential Maps (Rodrigues) can be used. Quaternions is good compromise between conversion to matrix speed and compactness.
From design point of view , there is a good rule "make all design decisions as LATE as possible". In your case, just incapsulate (isolate) the rotation (transformation) representation, to be able in the future, to change the physical storage of data in different states (file, memory, rendering and more). Also it enables different platform optimization, keep data in GPU or CPU and more.
Been there.
First: keep in mind the omnipresent struggle of time against space (in computer science processing time against memory requirements)
You said that want to keep minimum information possible at first (space), and next talked about some temporary matrix reflecting the quartenions, which is more of a time worry.
If you accept a tip, I would go for the matrices. They are generally performance wise standard for 3D graphics and it's size becomes easily irrelevant next to the object data itself.
Just to have and idea: in most GPUs transforming an vector for the identity (no change) is actually faster then checking if it needs transformation and then doing nothing.
As for engines, I can't think of one that does not apply the transformations for every vertex every frame. Even if the objects keep in place, they position has to go through projection and view matrices.
(does this answer? Maybe I got you wrong)
Which is faster, a single call to glUseProgram, or sending e.g. 6 or so floats via glUniform (batched or separately), and by approximately how much?
Can you describe in more detail the scenario where you think this affects the performance of the rendering pipeline? They offer completely different functionalities and I don't see why you would care about the performance of glUseProgram vs glUniform.
Now let's analyze what happens when you use this functions to get an idea of their cost.
When you call glUseProgram it changes several OpenGL rendering states because we are going to use new shaders attached to the program object. The specification says that vertex and fragment programs are installed in the processors when you invoke this function. That alone seems costly enough to overshadow the cost of glUniform. Also, when you install new vertex and fragment programs, additional states of the rendering pipeline are changed to accomodate the number of texture units and data layout used by the programs.
glUniform copies data from one location of memory to another to specify the value of an uniform variable. The worst case would be copying matrices which seems less complex than glUseProgram.
But in the end, it all depends of the amount of data you are transferring with glUniform and the underlying implementation of glUseProgram (it could be super optimized by the driver and have a very small cost) and if your engine is smart enough to group the geometry that uses the same program and draw it without changing states.
The answer would seem to be no, because raymarching is highly conditional i.e. each ray follows a unique execution path, since on each step we check for opacity, termination etc. that will vary based on the direction of the individual ray.
So it would seem that SIMD would largely not be able to accelerate this; rather, MIMD would be required for acceleration.
Does this make sense? Or am I missing something(s)?
As stated already, you could probably get a speedup from implementing your
vector math using SSE instructions (be aware of the effects discussed
here - also for the other approach). This approach would allow the code
stay concise and maintainable.
I assume, however, your question is about "packet traversal" (or something
like it), in other words to process multiple scalar values each of a
different ray:
In principle it should be possible deferring the shading to another pass.
The SIMD packet could be repopulated with a new ray once the bare marching
pass terminates and the temporary result is stored as input for the shading
pass. This will allow to parallelize a certain, case-dependent percentage
of your code exploting all four SIMD lanes.
Tiling the image and indexing the rays within it in Morton-order might be
a good idea too in order to avoid cache pressure (unless your geometry is
strictly procedural).
You won't know whether it pays off unless you try. My guess is, that if it
does, the amount of speedup might not be worth the complication of the code
for just four lanes.
Have you considered using an SIMT architecture such as a programmable GPU?
A somewhat up-to-date programmable graphics board allows you to perform
raymarching at interactive rates (see it happen in your browser here).
The last days I built a software-based raymarcher for a menger sponge. At the moment without using SIMD and I also used no special algorithm. I just trace from -1 to 1 in X and Y, which are U and V for the destination texture. Then I got a camera position and a destination which I use to calculate the increment vector for the raymarch.
After that I use a constant value of iterations to perform, in which only one branch decides if there's an intersection with the fractal volume. So if my camera eye is E and my direction vector is D I have to find the smallest t. If I found that or reached a maximal distance I break the loop. At the end I have t - from that I calculate the fragment color.
In my opinion it should be possible to parallelize these operations by SSE1/2, because one can solve the branch by null'ing the field in the vector (__m64 / __m128), so further SIMD operations won't apply here. It really depends on what you raymarch/-cast but if you just calculate a fragment color from a function (like my fractal curve here is) and don't access memory non-linearly there are some tricks to make it possible.
Sure, this answer contains speculation, but I will keep you informed when I've parallelized this routine.
Only insofar as SSE, for instance, lets you do operations on vectors in parallel.