Performance of different CG/GLSL/HLSL functions - performance

There are standard libraries of shader functions, such as for Cg. But are there resources which tell you how long each takes... I'm thinking similar to how you used to be able to look up how many cycles each ASM op would take.

There are no reliable resources that will tell you how long various standard shader functions take. Not even for a particular piece of hardware.
The reason for this has to do with instruction scheduling and the way modern shader architectures work. Take a simple sin function. Let's say that the hardware has a special hardware to compute the sine of a value, so it's not manually using a Tailor series or something. However, let's also say that it takes a sequence of 4 opcodes to actually compute it. Therefore, sin would take "4 cycles".
However, all of those opcodes are scalar operations. Therefore, while they're going on, you could in fact have some 3-vector dot-products, or in the case of some hardware, 4-vector dot-products going on at the same time, on the same processor. Therefore, if the hardware has 4-vector dot-products with scalar operations, the number of cycles it takes to execute a sin and a matrix-vector multiply is... still 4.
So how much did the sin operation cost? If you take out the matrix multiply, nothing gets faster. If you take out the sin, nothing still gets faster. How much does it cost? You can't say, because the cost of a single operation is irrelevant; the only measurable quantity is the cost of the shader itself.
Ultimately, all you can do is try to build your shader reasonably and see what the performance is. Unless you have low-level debugging tools to deprocess the underlying shader assembly (and no, DX assembly isn't good enough), that's really the best you can do.

Related

How to accurately measure performance of sorting algorithms

I have a bunch of sorting algorithms in C I wish to benchmark. I am concerned regarding good methodology for doing so. Things that could affect benchmark performance include (but are not limited to): specific coding of the implementation, programming language, compiler (and compiler options), benchmarking machine and critically the input data and time measuring method. How do I minimize the effect of said variables on the benchmark's results?
To give you a few examples, I've considered multiple implementations on two different languages to adjust for the first two variables. Moreover I could compile the code with different compilers on fairly mundane (and specified) arguments. Now I'm going to be running the test on my machine, which features turbo boost and whatnot and often boosts a core running stuff to the moon. Of course I will be disabling that and doing multiple runs and likely taking their mean completion time to adjust for that as well. Regarding the input data, I will be taking different array sizes, from very small to relatively large. I do not know what the increments should ideally be like, and what the range of the elements should be as well. Also I presume duplicate elements should be allowed.
I know that theoretical analysis of algorithms accounts for all of these methods, but it is crucial that I complement my study with actual benchmarks. How would you go about resolving the mentioned issues, and adjust for these variables once the data is collected? I'm comfortable with the technologies I'm working with, less so with strict methodology for studying a topic. Thank you.
You can't benchmark abstract algorithms, only specific implementations of them, compiled with specific compilers running on specific machines.
Choose a couple different relevant compilers and machines (e.g. a Haswell, Ice Lake, and/or Zen2, and an Apple M1 if you can get your hands on one, and/or an AArch64 cloud server) and measure your real implementations. If you care about in-order CPUs like ARM Cortex-A53, measure on one of those, too. (Simulation with GEM5 or similar performance simulators might be worth trying. Also maybe relevant are low-power implementations like Intel Silvermont whose out-of-order window is much smaller, but also have a shorter pipeline so smaller branch mispredict penalty.)
If some algorithm allows a useful micro-optimization in the source, or that a compiler finds, that's a real advantage of that algorithm.
Compile with options you'd use in practice for the use-cases you care about, like clang -O3 -march=native, or just -O2.
Benchmarking on cloud servers makes it hard / impossible to get an idle system, unless you pay a lot for a huge instance, but modern AArch64 servers are relevant and may have different ratios of memory bandwidth vs. branch mispredict costs vs. cache sizes and bandwidths.
(You might well find that the same code is the fastest sorting implementation on all or most of the systems you test one.
Re: sizes: yes, a variety of sizes would be good.
You'll normally want to test with random data, perhaps always generated from the same PRNG seed so you're sorting the same data every time.
You may also want to test some unusual cases like already-sorted or almost-sorted, because algorithms that are extra fast for those cases are useful.
If you care about sorting things other than integers, you might want to test with structs of different sizes, with an int key as a member. Or a comparison function that does some amount of work, if you want to explore how sorts do with a compare function that isn't as simple as just one compare machine instruction.
As always with microbenchmarking, there are many pitfalls around warm-up of arrays (page faults) and CPU frequency, and more. Idiomatic way of performance evaluation?
taking their mean completion time
You might want to discard high outliers, or take the median which will have that effect for you. Usually that means "something happened" during that run to disturb it. If you're running the same code on the same data, often you can expect the same performance. (Randomization of code / stack addresses with page granularity usually doesn't affect branches aliasing each other in predictors or not, or data-cache conflict misses, but tiny changes in one part of the code can change performance of other code via effects like that if you're re-compiling.)
If you're trying to see how it would run when it has the machine to itself, you don't want to consider runs where something else interfered. If you're trying to benchmark under "real world" cloud server conditions, or with other threads doing other work in a real program, that's different and you'd need to come up with realistic other loads that use some amount of shared resources like L3 footprint and memory bandwidth.
Things that could affect benchmark performance include (but are not limited to): specific coding of the implementation, programming language, compiler (and compiler options), benchmarking machine and critically the input data and time measuring method.
Let's look at this from a very different perspective - how to present information to humans.
With 2 variables you get a nice 2-dimensional grid of results, maybe like this:
A = 1 A = 2
B = 1 4 seconds 2 seconds
B = 2 6 seconds 3 seconds
This is easy to display and easy for humans to understand and draw conclusions from (e.g. from my silly example table it's trivial to make 2 very different observations - "A=1 is twice as fast as A=2 (regardless of B)" and "B=1 is faster than B=2 (regardless of A)").
With 3 variables you get a 3-dimensional grid of results, and with N variables you get an N-dimensional grid of results. Humans struggle with "3-dimensional data on 2-dimensional screen" and more dimensions becomes a disaster. You can mitigate this a little by "peeling off" a dimension (e.g. instead of trying to present a 3D grid of results you could show multiple 2D grids); but that doesn't help humans much.
Your primary goal is to reduce the number of variables.
To reduce the number of variables:
a) Determine how important each variable is for what you intend to observe (e.g. "which algorithm" will be extremely important and "which language" will be less important).
b) Merge variables based on importance and "logical grouping". For example, you might get three "lower importance" variables (language, compiler, compiler options) and merge them into a single "language+compiler+options" variable.
Note that it's very easy to overlook a variable. For example, you might benchmark "algorithm 1" on one computer and benchmark "algorithm 2" on an almost identical computer, but overlook the fact that (even though both benchmarks used identical languages, compilers, compiler options and CPUs) one computer has faster RAM chips, and overlook "RAM speed" as a possible variable.
Your secondary goal is to reduce number of values each variable can have.
You don't want massive table/s with 12345678 million rows; and you don't want to spend the rest of your life benchmarking to generate such a large table.
To reduce the number of values each variable can have:
a) Figure out which values matter most
b) Select the right number of values in order of importance (and ignore/skip all other values)
For example, if you merged three "lower importance" variables (language, compiler, compiler options) into a single variable; then you might decide that 2 possibilities ("C compiled by GCC with -O3" and "C++ compiled by MSVC with -Ox") are important enough to worry about (for what you're intending to observe) and all of the other possibilities get ignored.
How do I minimize the effect of said variables on the benchmark's results?
How would you go about resolving the mentioned issues, and adjust for these variables once the data is collected?
By identifying the variables (as part of the primary goal) and explicitly deciding which values the variables may have (as part of the secondary goal).
You've already been doing this. What I've described is a formal method of doing what people would unconsciously/instinctively do anyway. For one example, you have identified that "turbo boost" is a variable, and you've decided that "turbo boost disabled" is the only value for that variable you care about (but do note that this may have consequences - e.g. consider "single-threaded merge sort without the turbo boost it'd likely get in practice" vs. "parallel merge sort that isn't as influenced by turning turbo boost off").
My hope is that by describing the formal method you gain confidence in the unconscious/instinctive decisions you're already making, and realize that you were very much on the right path before you asked the question.

How to test my generated assembly program?

I've made a program which generate assembly instructions according to arguments for my vector extension to perform convolution. Note that I assume My vector extension doesn't have a loop or branch instruction
However, if I set input width = 7, kernel width = 3, Input channel = 128, Output channel = 4, then the number of generated instructions is almost 90,000. I have an instruction simulator for this vector processor but I can't figure it out how to check my generated instructions are sane or not.
Is there any good point to start or any good idea?
The obvious thing would be to run it with some fully randomized test inputs, and compare against the result of a simple known-good implementation with the same data input. (e.g. written in C or your favourite high-level language, possibly just running on the host CPU, not inside the simulator). A simple implementation running inside your simulator would be good to have as well, or instead if that's easier.
When you compare results, you may need to allow some wiggle room for FP rounding errors if your simple implementation uses a different order of operations. Like a pretty standard thing would be to check that the absolute differences are all within 1e-7 or something, or check relative differences (although relative-error can be large for numbers near zero that resulted from subtraction; catastrophic cancellation is a known problem for FP).
(See also https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/ and the rest of Bruce's series of FP articles if you're not already aware of these issues.)
Perhaps worth having a reference implementation that computes in double-precision so you have a better idea what the actual correct answers are, when evaluating a computation with rounding errors.
Debugging when data doesn't match the reference:
Test again with very simple input data, like all 0.0 except a 1.0 in one element. That might highlight a wrong array indexing problem. Or all 1.0, or all -2.0.
Or some input that should produce a very simple output, for the known algorithm you're trying to implement. e.g. if most outputs are supposed to be 0.0, seeing which ones aren't, or what value they have, could be a big hint.
Also note that most real-world CPUs have some kind of instruction cache, so it's usually worth a tiny bit of loop overhead (large unrolled loop) to recycle a loop body that fits in cache, instead of fully unrolling / peeling a loop into a huge block of straight-line code. (Like 90k instructions sounds like too much). But if there really isn't any simple repetition that can be amortized via unrolling, it's worth considering this.

swap two variables. which way is faster?

Let's say we have two integers a and b. which way is faster for swapping their values?
c=a;
a=b;
b=c;//(edited typo)
or
a=a+b;
b=a-b;
a=a-b;
or bitwise xor
a=a^b;
b=a^b;
a=a^b;
I'll test its performance differences when I'll be able but I'd like to know it now. Is it bitwise?
Firstly, you cannot quantify the speed of an algorithm independent of the program language, the compiler and the platform on which it is run. An algorithm is a mathematical abstraction.
Having said that:
for a typical programming language,
and a typical compiler, and
a typical execution platform,
the first version will typically be faster because it will typically compile to fewer native instructions that take less clock cycles to execute. The first version only requires load and save operations. The other two versions have (at least) the same number of loads and saves, and some additional arithmetic or bit manipulation instructions.
However, even that is not cut-and-dry.
The 2nd and 3rd examples are performing the swap without using a temporary variable. This is something you might do if using an extra temporary variable was expensive. This might happen on a machine which didn't provide enough general purpose registers, and the relative cost of loading / saving to memory was large. In some circumstances, the native code equivalents could be optimal.
However ... and this is the real point ... the best strategy is to leave this kind of decision to the compiler. Unless you are prepared to put a huge amount of effort into micro-optimizing, the compiler is likely to be able to a better job than you can. Indeed, writing code in "cunning ways" is liable to make it harder for the compiler to optimize. (In the 3rd case for example, the compiler would need to figure out that that sequence is actually swapping 2 variables before it can substitute the optimal instruction sequence. Chances are that the optimizer won't be able to do that.)

Variable storage versus redundant arithmetic

I'm writing a very simple loop in Lua for a LÖVE game I'm developing. I understand I'll waste more time worrying about this than will ever be spent on any CPU clock cycles the answer to this question saves me, but I want a deeper knowledge of how this works.
The current body of the loop is like so:
local low = mid - diff
local high = mid + diff
love.graphics.line(low, 0, low, wheight)
love.graphics.line(high, 0, high, wheight)
I want to know if it will be more computationally efficient to keep it as is or to change it to:
love.graphics.line(mid - diff, 0, mid - diff, wheight)
love.graphics.line(mid + diff, 0, mid + diff, wheight)
With the second body, I have to calculate the low and high differences twice each. With the first, I have to store them in memory and access them twice each.
Which is more efficient?
The short answer is that it'll be unlikely to make any difference at all. Even if there is any kind of difference, your code next to it is drawing a line, for example. Drawing even an aliased line with very optimized Bresenham implemented in native code is enormously expensive in comparison to an add and subtract. Even the function call alone will likely dwarf this cost.
With the second body, I have to calculate the low and high differences
twice each. With the first, I have to store them in memory and access
them twice each.
This is not necessarily the case. Variables don't necessarily "store memory" in ways that expressions don't. They can directly map to a register. Likewise, avoiding variables doesn't necessarily "avoid memory". Expressions will likewise be computed and stored in registers, whether you explicitly assign the intermediate results to variables or not.
So from a memory standpoint, both versions of your code need to use registers to store intermediate results of a computation.
Memoization doesn't necessarily have that kind of memory overhead when you're just involving simple variables mainly because the types map directly to registers without stack spills. When you start computing whole arrays/tables in advance, sometimes doing additional computation will be faster than memoization if the memoization means more DRAM access (in which case the memory overhead can outweigh the savings). But simple POD-type variables like numbers don't have that DRAM overhead, they map directly to registers. In other words, they're often literally free: the compiler will emit the same machine code whether or not you assigned the result of your expressions to local variables or not -- the same number of registers will be required.
Local variables for data types that map directly to GP registers are best thought as only existing while you're in that high-level coding land. By the time the JIT or interpreter compiles your code into a form the machine understands, they'll disappear and turn into registers regardless of whether you created those variables or not.
Probably the ultimate question, if there is to be any difference, is whether the redundant computation can be eliminated. It would take only the most trivial optimizer to figure out that mid - diff written twice in the exact same statement only needs to be computed once. I'd be surprised if that didn't get optimized away by the time it reaches the IR instruction selection and register allocation stage.
But even if it turned out to be a surprise, and the Lua interpreter was so inefficient as to fail to recognize the completely redundant computation and performed it anyway, again, you have code next to it that renders a line (which involves loopy rasterization). Relatively speaking, this is practically free even with the redundancy. Here it's not worth sweating such small stuff, and this is coming from someone obsessed with shaving clock cycles.

Floating point operations speed

Is floating point multiplication with 0.0 faster than average fp multiplication? The same question about adding 0.0 and multiplying with 1.0.
For the question to make exact sense: Is it faster on recent Intel CPUs?
No, not on modern hardware. Modern hardware can perform all normal double precision multiplications/additions/subtractions in one or two cycles. Possible exceptions to this are denormalized numbers and special values like +/-zero, +/-infinity, and NANs. These exceptions take longer if there is a difference.
However, as with all performance related questions, truth is only in measurements. If this is important to you, measure it, then you know what to do.

Resources