I am a newbie in OpenGL, I have a question and must answer to my leader: "Why bool expressions like the one used in the above example should be avoided in if and if-else conditional statements". I must answer it tomorrow but I don't have any clue, any one can help me,
Thanks!
P/s: this code here:
void main ()
{
vec4 color = texture2D ( tex , v_uv);
if (color.r < 0.25)
gl_FragColor = texture2D (tex1 , v _uv);
else
gl_ FragColor = texture2D ( tex2, v _uv);
}
You didn't provide the example, but I'm going to just make assumptions and say that branching on GPU can be a bad thing..
Different GPUs have support for different styles of branching though, so the impact depends on the code and your target's support (SIMD, MIMD, condition code branching, etc).
Depending on the type of branching (ie: what conditions you are checking and what the resulting code is), other cores in the grid may end up waiting until the last completes it's if branch and rustling code. So, if you have one core that went off and had to do some complicated stuff due to a condition being satisfied, then all cores will need to wait on said core. This can really add up and reduce your performance... But it depends on the target and code!
Because GPU's don't like dynamic branching.
Related
I was wondering if anyone might know whether there might be some kind of optimization going on with HLSL InterlockedAdd, specifically when it is used on a single global atomic counter (added value is constant across all threads) by a large number of threads.
Some information I dug up on the web says that atomic adds can create significant contention issues:
https://developer.nvidia.com/blog/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/
Granted, the article above is written for CUDA (also a little old dating to 2014), whereas I am interested in HLSL InterlockedAdd. To that end, I wrote a dummy HLSL shader for Unity (compiled to d3d11 via FXC, to my knowledge), where I call InterlockedAdd on a single global atomic counter, such that the added value is always the same across all the shaded fragments. The snippet in question (run in http://shader-playground.timjones.io/, compiled via FXC, optimization lvl 3, shading model 5.0):
**HLSL**:
RWStructuredBuffer<int> counter : register(u1);
void PSMain()
{
InterlockedAdd(counter[0], 1);
}
----
**Assembly**:
ps_5_0
dcl_globalFlags refactoringAllowed
dcl_uav_structured u1, 4
atomic_iadd u1, l(0, 0, 0, 0), l(1)
ret
I then slightly modified the code, and instead of always adding some constant value, I now add a value that varies across fragments, so something like this:
**HLSL**:
RWStructuredBuffer<int> counter : register(u1);
void PSMain(float4 pixel_pos : SV_Position)
{
InterlockedAdd(counter[0], int(pixel_pos.x));
}
----
**Assmebly**:
ps_5_0
dcl_globalFlags refactoringAllowed
dcl_uav_structured u1, 4
dcl_input_ps_siv linear noperspective v0.x, position
dcl_temps 1
ftoi r0.x, v0.x
atomic_iadd u1, l(0, 0, 0, 0), r0.x
ret
I implemented the equivalents of the aforementioned snippets in Unity, and used them as my fragment shaders for rendering a full-screen quad (granted, there is no output semantics, but that is irrelevant). I profiled the resulting shaders with Nsight Grphics. Suffice to say that the difference between two draw calls was massive, with the fragment shader based on the second snippet (InterlockedAdd with variable value) being considerably slower.
I also made captures with RenderDoc to check the assembly, and they look identical to what is shown above. Nothing in the assembly code suggests such dramatic difference. And yet, the difference is there.
So my question is: is there some kind of optimization taking place when using HLSL InterlockedAdd on a single global atomic counter, such that the added value is a constant? Is it, perhaps, possible that the GPU driver can somehow rearrange the code?
System specs:
NVIDIA Quadro P4000
Windows 10
Unity 2019.4
The pixel shader on the GPU runs pixels in simd groups, called wavefronts. If the code currently executing would not change based on which pixel is being rendered the code only has to be run once for the entire group. If it changes based on the pixel then each of the pixels will need to run unique code.
In the first version, a 64 pixel wavefront would execute the code as a single simd InterlockedAdd<64>(counter[0], 1); or might even optimize it into InterlockedAdd(counter[0], 64);
In the second example it turns into a series of serial, non-simd Adds and becomes 64 times as expensive.
This is an oversimplification, and there are other tricks the GPU uses to share computing resources. But a good general rule of thumb is to make as much code as possible sharable by every nearby pixel.
Does it make a performance difference in GLSL if something simple like a + operator is wrapped into a function?
So for example these two scenarios:
Example 1:
in uniform float uValueA;
in uniform float uValueB;
void main()
{
float value = uValueA + uValueB;
// [...]
}
Example 2:
in uniform float uValueA;
in uniform float uValueB;
float addValues(float a, float b)
{
return a + b;
}
void main()
{
float value = addValues(uValueA, uValueB);
// [...]
}
is there any difference in the compiled end product? Or do they result in the same number of instructions and performance?
When I tested this specific case a couple years ago, I found no performance difference between functions or in-line code. If I remember correctly, at the time I used tools from Nvidia and/or AMD to look at the assembly code generated from the GLSL files. This also confirmed that the assembly was identical whether I used functions or not. This suggests that functions are inlined.
I suggest you have a look for yourself at the assembly code of both versions of your shader to convince yourself. This question (https://gamedev.stackexchange.com/questions/65695/aquire-disassembly-of-shader-code) explains some ways to get this information.
You essentially can assume nothing about the optimization of your shader, because the compilation is vendor specific. It would make sense that a compiler would optimize this very simple case, and inline the function, making the two equivalent, but that is in no way guaranteed. They could in theory insert a million no-ops for every function call (although, the person who wrote their compiler might be fired :)).
That said, you can "pre-optimize" your GLSL code, such that these sorts of optimizations are performed before the code is sent to the compiler (generally done offline). The glsl-optimizer is frequently used for this purpose, and used built into the Unity engine.
All modern GPUs have scalar architecture, but shading languages offer a variety of vector and matrix types. I would like to know, how does scalarization or vectorization of GLSL source code affect performance. For example, let's define some "scalar" points:
float p0x, p0y, p1x, p1y, p2x, p2y, p3x, p3y, p4x, p4y;
p0x = 0.0f; p0y = 0.0f;
p1x = 0.0f; p1y = 0.61f;
p2x = 0.9f; p2y = 0.4f;
p3x = 1.0f; p3y = 1.0f;
and their vector equivalents:
vec2 p0 = vec2(p0x, p0y);
vec2 p1 = vec2(p1x, p1y);
vec2 p2 = vec2(p2x, p2y);
vec2 p3 = vec2(p3x, p3y);
Having these points, which of the following mathematically equivalent pieces of code will run faster?
Scalar code:
position.x = -p0x*pow(t-1.0,3.0)+p3x*(t*t*t)+p1x*t*pow(t-1.0,2.0)*3.0-p2x*(t*t)*(t-1.0)*3.0;
position.y = -p0y*pow(t-1.0,3.0)+p3y*(t*t*t)+p1y*t*pow(t-1.0,2.0)*3.0-p2y*(t*t)*(t-1.0)*3.0;
or it's vector equivalent:
position.xy = -p0*pow(t-1.0,3.0)+p3*(t*t*t)+p1*t*pow(t-1.0,2.0)*3.0-p2*(t*t)*(t-1.0)*3.0;
?
Or will they run equivalently fast on modern GPUs?
The above code is only an example. Real-life examples of such "vectorizable" code may perform much heavier computations with much more input variables coming from global ins, uniforms and vertex attributes.
The vectorised version is highly unlikely to be slower - in the worst case, it will probably just be replaced with the scalar version by the compiler anyway.
It may however be faster. Whether it will be faster largely depends on whether the code branches - if there are no branches, it is easier to feed the processing to multiple SIMD lanes than with code which branches. Compilers are pretty smart, and might be able to figure out that the scalar version can also be sent to multiple SIMD lanes ... but the compiler is more likely to be able to do its job to the best of its ability using the vectorised version. They're also smart enough to sometimes keep the SIMD lanes fed in the presence of limited branching, so even with branching code you are probably better off using the vectorised version.
Your best bet is to do benchmarking on all the varieties of Systems (i.e. GPUs) that you believe could be used with this code, and work out which ones are faster with the Vectorized code, and which are faster with the Scalarized code. Then you'd write both versions of the code (or, more likely, the multitude of versions), and write runtime logic to switch which version is being used based on which GPU/drivers are being used.
That, of course, is a huge hassle. Most programmers won't do that; GPGPU programmers usually have only a single server/GPU node type that they work with, so their code will be specifically tailored to only a single architecture. Meanwhile, at AAA Game Studios (which are the only other place which would have the budget and manpower to tackle that kind of task) they usually just let NVidia and AMD sort out that magic on their end, where NVidia/AMD will write better, more optimized versions of the Shaders used by those games, add them to their drivers, and tell the drivers to substitute in the better Shaders instead of whatever Gearbox/Bethesda/whomever tried to load.
The important thing is, for your use case, your best bet is to focus on making the code more maintainable; that will save you way more time, and will make your program run better, than any "premature optimization" will (which, let's be clear, is basically what you're doing).
I'd like to know if someone has experience in writing a HAL AudioUnit rendering callback taking benefits of multi-core processors and/or symmetric multiprocessing?
My scenario is the following:
A single audio component of sub-type kAudioUnitSubType_HALOutput (together with its rendering callback) takes care of additively synthesizing n sinusoid partials with independent individually varying and live-updated amplitude and phase values. In itself it is a rather straightforward brute-force nested loop method (per partial, per frame, per channel).
However, upon reaching a certain upper limit for the number of partials "n", the processor gets overloaded and starts producing drop-outs, while three other processors remain idle.
Aside from general discussion about additive synthesis being "processor expensive" in comparison to let's say "wavetable", I need to know if this can be resolved right way, which involves taking advantage of multiprocessing on a multi-processor or multi-core machine? Breaking the rendering thread into sub-threads does not seem the right way, since the render callback is already a time-constraint thread in itself, and the final output has to be sample-acurate in terms of latency. Has someone had positive experience and valid methods in resolving such an issue?
System: 10.7.x
CPU: quad-core i7
Thanks in advance,
CA
This is challenging because OS X is not designed for something like this. There is a single audio thread - it's the highest priority thread in the OS, and there's no way to create user threads at this priority (much less get the support of a team of systems engineers who tune it for performance, as with the audio render thread). I don't claim to understand the particulars of your algorithm, but if it's possible to break it up such that some tasks can be performed in parallel on larger blocks of samples (enabling absorption of periods of occasional thread starvation), you certainly could spawn other high priority threads that process in parallel. You'd need to use some kind of lock-free data structure to exchange samples between these threads and the audio thread. Convolution reverbs often do this to allow reasonable latency while still operating on huge block sizes. I'd look into how those are implemented...
Have you looked into the Accelerate.framework? You should be able to improve the efficiency by performing operations on vectors instead of using nested for-loops.
If you have vectors (of length n) for the sinusoidal partials, the amplitude values, and the phase values, you could apply a vDSP_vadd or vDSP_vmul operation, then vDSP_sve.
As far as I know, AU threading is handled by the host. A while back, I tried a few ways to multithread an AU render using various methods, (GCD, openCL, etc) and they were all either a no-go OR unpredictable. There is (or at leas WAS... i have not checked recently) a built in AU called 'deferred renderer' I believe, and it threads the input and output separately, but I seem to remember that there was latency involved, so that might not help.
Also, If you are testing in AULab, I believe that it is set up specifically to only call on a single thread (I think that is still the case), so you might need to tinker with another test host to see if it still chokes when the load is distributed.
Sorry I couldn't help more, but I thought those few bits of info might be helpful.
Sorry for replying my own question, I don't know the way of adding some relevant information otherwise. Edit doesn't seem to work, comment is way too short.
First of all, sincere thanks to jtomschroeder for pointing me to the Accelerate.framework.
This would perfectly work for so called overlap/add resynthesis based on IFFT. Yet I haven't found a key to vectorizing the kind of process I'm using which is called "oscillator-bank resynthesis", and is notorious for its processor taxing (F.R. Moore: Elements of Computer Music). Each momentary phase and amplitude has to be interpolated "on the fly" and last value stored into the control struct for further interpolation. Direction of time and time stretch depend on live input. All partials don't exist all the time, placement of breakpoints is arbitrary and possibly irregular. Of course, my primary concern is organizing data in a way to minimize the number of math operations...
If someone could point me at an example of positive practice, I'd be very grateful.
// Here's the simplified code snippet:
OSStatus AdditiveRenderProc(
void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData)
{
// local variables' declaration and behaviour-setting conditional statements
// some local variables are here for debugging convenience
// {... ... ...}
// Get the time-breakpoint parameters out of the gen struct
AdditiveGenerator *gen = (AdditiveGenerator*)inRefCon;
// compute interpolated values for each partial's each frame
// {deltaf[p]... ampf[p][frame]... ...}
//here comes the brute-force "processor eater" (single channel only!)
Float32 *buf = (Float32 *)ioData->mBuffers[channel].mData;
for (UInt32 frame = 0; frame < inNumberFrames; frame++)
{
buf[frame] = 0.;
for(UInt32 p = 0; p < candidates; p++){
if(gen->partialFrequencyf[p] < NYQUISTF)
buf[frame] += sinf(phasef[p]) * ampf[p][frame];
phasef[p] += (gen->previousPartialPhaseIncrementf[p] + deltaf[p]*frame);
if (phasef[p] > TWO_PI) phasef[p] -= TWO_PI;
}
buf[frame] *= ovampf[frame];
}
for(UInt32 p = 0; p < candidates; p++){
//store the updated parameters back to the gen struct
//{... ... ...}
;
}
return noErr;
}
Recently I've bumped into the following C++ code:
if (a)
{
f();
}
else if (b)
{
f();
}
else if (c)
{
f();
}
Where a, b and c are all different conditions, and they are not very short.
I tried to change the code to:
if (a || b || c)
{
f();
}
But the author opposed saying that my change will decrease readability of the code. I had two arguments:
1) You should not increase readability by replacing one branching statement with three (though I really doubt that it's possible to make code more readable by using else if instead of ||).
2) It's not the fastest code, and no compiler will optimize this.
But my arguments did not convince him.
What would you tell a programmer writing such a code?
Do you think complex condition is an excuse for using else if instead of OR?
This code is redundant. It is prone to error.
If you were to replace f(); someday with something else, there is the danger you miss one out.
There may though be a motivation behind that these three condition bodies could one day become different and you sort of prepare for this situation. If there is a strong possibility it will happen, it may be okay to do something of the sort. But I'd advice to follow the YAGNI principle (You Ain't Gonna Need It). Can't say how much bloated code has been written not because of the real need but just in anticipation of it becoming needed tomorrow. Practice shows this does not bring any value during the entire lifetime of an application but heavily increases maintenance overhead.
As to how to approach explaining it to your colleague, it has been discussed numerous times. Look here:
How do you tell someone they’re writing bad code?
How to justify to your colleagues that they produce crappy code?
How do you handle poor quality code from team members?
“Mentor” a senior programmer or colleague without insulting
Replace the three complex conditions with one function, making it obvious why f() should be executed.
bool ShouldExecute; { return a||b||c};
...
if ShouldExecute {f();};
Since the conditions are long, have him do this:
if ( (aaaaaaaaaaaaaaaaaaaaaaaaaaaa)
|| (bbbbbbbbbbbbbbbbbbbbbbbbbbbb)
|| (cccccccccccccccccccccccccccc) )
{
f();
}
A good compiler might turn all of these into the same code anyway, but the above is a common construct for this type of thing. 3 calls to the same function is ugly.
In general I think you are right in that if (a || b || c) { f(); } is easier to read. He could also make good use of whitespace to help separate the three blocks.
That said, I would be interested to see what a, b, c, and f look like. If f is just a single function call and each block is large, I can sort of see his point, although I cringe at violating the DRY principle by calling f three different times.
Performance is not an issue here.
Many people wrap themselves in the flag of "readability" when it's really just a matter of individual taste.
Sometimes I do it one way, sometimes the other. What I'm thinking about is -
"Which way will make it easier to edit the kinds of changes that might have to be made in the future?"
Don't sweat the small stuff.
I think that both of your arguments (as well as Developer Art's point about maintainability) are valid, but apparently your discussion partner is not open for a discussion.
I get the feeling that you are having this discussion with someone who is ranked as more senior. If that's the case, you have a war to fight and this is just one small battle, which is not important for you to win. Instead of spending time arguing about this thing, try to make your results (which will be far better than your discussion partner's if he's writing that kind of kode) speak for themselves. Just make sure that you get credit for your work, not the whole team or someone else.
This is probably not the kind of answer you expected to the question, but I got a feeling that there's something more to it than just this small argument...
I very much doubt there will be any performance gains of this, except at least in a very specific scenario. In this scenario you change a, b, and c, and thus which of the three that triggers the code changes, but the code executes anyhow, then reducing the code to one if-statement might improve, since the CPU might have the code in the branch cache when it gets to it next time. If you triple the code, so that it occupies 3 times the space in the branch cache, there is a higher chance one or more of the paths will be pushed out, and thus you won't have the most performant execution.
This is very low-level, so again, I doubt this will make much of an impact.
As for readability, which one is easier to read:
if something, do this
if something else, do this
if yet another something else, do this
"this" is the same in all three cases
or this:
if something, or something else, or yet another something else, then do this
Place some more code in there, other than just a simple function call, and it starts getting hard to identify that this is actually three identical pieces of code.
Maintainability goes down with the 3 if-statement organization because of this.
You also have duplicated code, almost always a sign of bad structure and design.
Basically, I would tell the programmer that if he has problems reading the 1 if-statement way of writing it, maybe C++ is not what he should be doing.
Now, let's assume that the "a", "b", and "c" parts are really big, so that the OR's in there gets lost in lots of noise with parenthesis or what not.
I would still reorganize the code so that it only called the function (or executed the code in there) in one place, so perhaps this is a compromise?
bool doExecute = false;
if (a) doExecute = true;
if (b) doExecute = true;
if (c) doExecute = true;
if (doExecute)
{
f();
}
or, even better, this way to take advantage of boolean logic short circuiting to avoid evaluating things unnecessary:
bool doExecute = a;
doExecute = doExecute || b;
doExecute = doExecute || c;
if (doExecute)
{
f();
}
Performance shouldn't really even come into question
Maybe later he wont call f() in all 3 conditons
Repeating code doesn't make things clearer, your (a||b||c) is much clearer although maybe one could refactor it even more (since its C++) e.g.
x.f()