GLSL performance differences between function call vs. inlining - performance

Does it make a performance difference in GLSL if something simple like a + operator is wrapped into a function?
So for example these two scenarios:
Example 1:
in uniform float uValueA;
in uniform float uValueB;
void main()
{
float value = uValueA + uValueB;
// [...]
}
Example 2:
in uniform float uValueA;
in uniform float uValueB;
float addValues(float a, float b)
{
return a + b;
}
void main()
{
float value = addValues(uValueA, uValueB);
// [...]
}
is there any difference in the compiled end product? Or do they result in the same number of instructions and performance?

When I tested this specific case a couple years ago, I found no performance difference between functions or in-line code. If I remember correctly, at the time I used tools from Nvidia and/or AMD to look at the assembly code generated from the GLSL files. This also confirmed that the assembly was identical whether I used functions or not. This suggests that functions are inlined.
I suggest you have a look for yourself at the assembly code of both versions of your shader to convince yourself. This question (https://gamedev.stackexchange.com/questions/65695/aquire-disassembly-of-shader-code) explains some ways to get this information.

You essentially can assume nothing about the optimization of your shader, because the compilation is vendor specific. It would make sense that a compiler would optimize this very simple case, and inline the function, making the two equivalent, but that is in no way guaranteed. They could in theory insert a million no-ops for every function call (although, the person who wrote their compiler might be fired :)).
That said, you can "pre-optimize" your GLSL code, such that these sorts of optimizations are performed before the code is sent to the compiler (generally done offline). The glsl-optimizer is frequently used for this purpose, and used built into the Unity engine.

Related

GLSL: scalar vs vector performance

All modern GPUs have scalar architecture, but shading languages offer a variety of vector and matrix types. I would like to know, how does scalarization or vectorization of GLSL source code affect performance. For example, let's define some "scalar" points:
float p0x, p0y, p1x, p1y, p2x, p2y, p3x, p3y, p4x, p4y;
p0x = 0.0f; p0y = 0.0f;
p1x = 0.0f; p1y = 0.61f;
p2x = 0.9f; p2y = 0.4f;
p3x = 1.0f; p3y = 1.0f;
and their vector equivalents:
vec2 p0 = vec2(p0x, p0y);
vec2 p1 = vec2(p1x, p1y);
vec2 p2 = vec2(p2x, p2y);
vec2 p3 = vec2(p3x, p3y);
Having these points, which of the following mathematically equivalent pieces of code will run faster?
Scalar code:
position.x = -p0x*pow(t-1.0,3.0)+p3x*(t*t*t)+p1x*t*pow(t-1.0,2.0)*3.0-p2x*(t*t)*(t-1.0)*3.0;
position.y = -p0y*pow(t-1.0,3.0)+p3y*(t*t*t)+p1y*t*pow(t-1.0,2.0)*3.0-p2y*(t*t)*(t-1.0)*3.0;
or it's vector equivalent:
position.xy = -p0*pow(t-1.0,3.0)+p3*(t*t*t)+p1*t*pow(t-1.0,2.0)*3.0-p2*(t*t)*(t-1.0)*3.0;
?
Or will they run equivalently fast on modern GPUs?
The above code is only an example. Real-life examples of such "vectorizable" code may perform much heavier computations with much more input variables coming from global ins, uniforms and vertex attributes.
The vectorised version is highly unlikely to be slower - in the worst case, it will probably just be replaced with the scalar version by the compiler anyway.
It may however be faster. Whether it will be faster largely depends on whether the code branches - if there are no branches, it is easier to feed the processing to multiple SIMD lanes than with code which branches. Compilers are pretty smart, and might be able to figure out that the scalar version can also be sent to multiple SIMD lanes ... but the compiler is more likely to be able to do its job to the best of its ability using the vectorised version. They're also smart enough to sometimes keep the SIMD lanes fed in the presence of limited branching, so even with branching code you are probably better off using the vectorised version.
Your best bet is to do benchmarking on all the varieties of Systems (i.e. GPUs) that you believe could be used with this code, and work out which ones are faster with the Vectorized code, and which are faster with the Scalarized code. Then you'd write both versions of the code (or, more likely, the multitude of versions), and write runtime logic to switch which version is being used based on which GPU/drivers are being used.
That, of course, is a huge hassle. Most programmers won't do that; GPGPU programmers usually have only a single server/GPU node type that they work with, so their code will be specifically tailored to only a single architecture. Meanwhile, at AAA Game Studios (which are the only other place which would have the budget and manpower to tackle that kind of task) they usually just let NVidia and AMD sort out that magic on their end, where NVidia/AMD will write better, more optimized versions of the Shaders used by those games, add them to their drivers, and tell the drivers to substitute in the better Shaders instead of whatever Gearbox/Bethesda/whomever tried to load.
The important thing is, for your use case, your best bet is to focus on making the code more maintainable; that will save you way more time, and will make your program run better, than any "premature optimization" will (which, let's be clear, is basically what you're doing).

Why is this Transpose() required in my WorldViewProj matrix?

Given a super-basic vertex shader such as:
output.position = mul(position, _gWorldViewProj);
I was having a great deal of trouble because I was setting _gWorldViewProj as follows; I tried both (a bit of flailing) to make sure it wasn't just backwards.
mWorldViewProj = world * view * proj;
mWorldViewProj = proj * view * world;
My solution turned out to be:
mWorldView = mWorld * mView;
mWorldViewProj = XMMatrixTranspose(worldView * proj);
Can someone explain why this XMMatrixTranspose was required? I know there were matrix differences between XNA and HLSL (I think) but not between vanilla C++ and HLSL, though I could be wrong.
Problem is I don't know if I'm wrong or what I'm wrong about! So if someone could tell me precisely why the transpose is required, I hopefully won't make the same mistake again.
On the CPU, 2D arrays are generally stored in row-major ordering, so the order in memory goes x[0][0], x[0][1], ... In HLSL, matrix declarations default to column-major ordering, so the order goes x[0][0], x[1][0], ...
In order to transform the memory from the format defined on the CPU to the order expected in HLSL, you need to transpose the CPU matrix before sending it to the GPU. Alternatively, you can row_major keyword in HLSL to declare the matrices as row major, eliminating the need for a transpose but leading to different codegen in HLSL (you'll often end up with mul-adds instead of dot-products).

Why boolean condition in if statement of OpenGL should be avoid?

I am a newbie in OpenGL, I have a question and must answer to my leader: "Why bool expressions like the one used in the above example should be avoided in if and if-else conditional statements". I must answer it tomorrow but I don't have any clue, any one can help me,
Thanks!
P/s: this code here:
void main ()
{
vec4 color = texture2D ( tex , v_uv);
if (color.r < 0.25)
gl_FragColor = texture2D (tex1 , v _uv);
else
gl_ FragColor = texture2D ( tex2, v _uv);
}
You didn't provide the example, but I'm going to just make assumptions and say that branching on GPU can be a bad thing..
Different GPUs have support for different styles of branching though, so the impact depends on the code and your target's support (SIMD, MIMD, condition code branching, etc).
Depending on the type of branching (ie: what conditions you are checking and what the resulting code is), other cores in the grid may end up waiting until the last completes it's if branch and rustling code. So, if you have one core that went off and had to do some complicated stuff due to a condition being satisfied, then all cores will need to wait on said core. This can really add up and reduce your performance... But it depends on the target and code!
Because GPU's don't like dynamic branching.

Library function capabilities of Mathematica

I am trying to use CUSP as an external linear solver for Mathematica to use the power of the GPU.
Here is the CUSP Project webpage. I am asking for some suggestion how we can integrate CUSP with Mathematica. I am sure many of you here will be interested to discuss this. I think writing a input matrix and then feeding it to CUSP program is not the way to go. Using Mathematica's LibrarayFunctionLoad will be a better way to pipeline the input matrix to the GPU based solver on the fly. What will be the way to supply the matrix and the right hand side matrix directly from Mathematica?
Here is some CUSP code snippet.
#include <cusp/hyb_matrix.h>
#include <cusp/io/matrix_market.h>
#include <cusp/krylov/cg.h>
int main(void)
{
// create an empty sparse matrix structure (HYB format)
cusp::hyb_matrix<int, float, cusp::device_memory> A;
// load a matrix stored in MatrixMarket format
cusp::io::read_matrix_market_file(A, "5pt_10x10.mtx");
// allocate storage for solution (x) and right hand side (b)
cusp::array1d<float, cusp::device_memory> x(A.num_rows, 0);
cusp::array1d<float, cusp::device_memory> b(A.num_rows, 1);
// solve the linear system A * x = b with the Conjugate Gradient method
cusp::krylov::cg(A, x, b);
return 0;
}
This question gives us the possibility to discuss compilation capabilities of Mathematica 8. It is also possible to invoke the topic of mathlink interface of MMA. I hope people here find this problem worthy and interesting enough to ponder on.
BR
If you want to use LibraryLink (for which LibraryFunctionLoad is used to access a dynamic library function as a Mathematica downvalue) there's actually not much room for discussion, LibraryFunctions can receive Mathematica tensors of machine doubles or machine integers and you're done.
The Mathematica MTensor format is a dense array, just as you'd naturally use in C, so if CUSP uses some other format you will need to write some glue code to translate between representations.
Refer to the LibraryLink tutorial for full details.
You will want to especially note the section "Memory Management of MTensors" in the Interaction with Mathematica page, and choose the "Shared" mode to just pass a Mathematica tensor by reference.

Can operations be auto-vectorized on struct's field referenced by pointer?

This is my code.
struct Vector
{
float x, y, z, w;
};
typedef struct Vector Vector;
inline void inv(Vector* target)
{
(*target).x = -(*target).x;
(*target).y = -(*target).y;
(*target).z = -(*target).z;
(*target).w = -(*target).w;
}
I'm using GCC for ARM (iPhone). Can this be vectorized?
PS: I'm trying some kind of optimization. Any recommendations are welcome.
Likely not, however you can try using a restrict pointer which will reduce aliasing concerns in the compiler and potentially produce better code.
It depends on how Vector is defined, but it may be possible. If you're looking for auto-vectorization then try Intel's ICC (assuming we're talking about x86 here ?), which does a pretty good job in certain cases (much better than gcc), although it can always be improved upon by explicit vectorization by hand of course, since the programmer knows more about the program than the compiler can every imply from the source code alone.

Resources