Are function calls in Metal Shaders expensive?

Are function calls in Metal Shaders expensive? - performance

I have code that's shared between different compute shaders, located in different #include files. It ranges from custom data types to utility functions.
I'm wondering whether these functions could become a performance issue as the project gets bigger and more of them need to be called?
Are functions automatically inlined when appropriate?

The Metal shader compiler should flatten out all shader code down into one method. You should not need to be concerned about inlining, the more important thing is that your code is constructed to take advantage of parallel processing and coalesced reads and writes.

Related

Working with DirectXMath and D3DXMath

Back in D3DXMath we had ability to multiply, add or subtract even divide vector types, which were D3DXVECTOR2, D3DXVECTOR3, D3DXVECTOR4 structures.....
Now in DirectXMath incarnation we have XMFLOAT2, XMFLOAT3, XMFLOAT4 and XMVECTOR. If i want to do any math operation i must do conversion from XMFLOAT to XMVECTOR either way Visual Studio is throwing an error "There is no
user defined conversion". Why is that ? Actually it's a fact that in a new version(Windows 8.1, 10) of DirectX math library vector operation slightly has changed . Am i doing something wrong........... ?!
P.S. Well for Matrices there are another question but right now lets talk only on vectors. These changes is pushing third party developers to create their own Math library and they had done it..... :)

This is actually explained in detail in the DirectXMath Programmer's Guide on MSDN:
The XMVECTOR and XMMATRIX types are the work horses for the DirectXMath Library. Every operation consumes or produces data of these types. Working with them is key to using the library. However, since DirectXMath makes use of the SIMD instruction sets, these data types are subject to a number of restrictions. It is critical that you understand these restrictions if you want to make good use of the DirectXMath functions.
You should think of XMVECTOR as a proxy for a SIMD hardware register, and XMMATRIX as a proxy for a logical grouping of four SIMD hardware registers. These types are annotated to indicate they require 16-byte alignment to work correctly. The compiler will automatically place them correctly on the stack when they are used as a local variable, or place them in the data segment when they are used as a global variable. With proper conventions, they can also be passed safely as parameters to a function (see Calling Conventions for details).
Allocations from the heap, however, are more complicated. As such, you need to be careful whenever you use either XMVECTOR or XMMATRIX as a member of a class or structure to be allocated from the heap. On Windows x64, all heap allocations are 16-byte aligned, but for Windows x86, they are only 8-byte aligned. There are options for allocating structures from the heap with 16-byte alignment (see Properly Align Allocations). For C++ programs, you can use operator new/delete/new[]/delete[] overloads (either globally or class-specific) to enforce optimal alignment if desired.
However, often it is easier and more compact to avoid using XMVECTOR or XMMATRIX directly in a class or structure. Instead, make use of the XMFLOAT3, XMFLOAT4, XMFLOAT4X3, XMFLOAT4X4, and so on, as members of your structure. Further, you can use the Vector Loading and Vector Storage functions to move the data efficiently into XMVECTOR or XMMATRIX local variables, perform computations, and store the results. There are also streaming functions (XMVector3TransformStream, XMVector4TransformStream, and so on) that efficiently operate directly on arrays of these data types.
By design, DirectXMath is encouraging you to write efficient, SIMD-friendly code. Loading or storing a vector is expensive, so you should try to work in a 'stream' model where you load data, work with it in-register a lot, then write the results.
That said, I totally get that the usage is a little complex for people new to SIMD math or DirectX in general, and is a bit verbose even for professional developers. That's why I also wrote the SimpleMath wrapper for DirectXMath which makes it work more like the classic math library you are looking for using XNA Game Studio like Vector2, Vector3, Matrix classes with 'C++ magic' covering up all the explicit loads & stores. SimpleMath types interop neatly with DirectXMath, so you can mix and match as you want.
See this blog post and GitHub as well.
DirectXMath is purposely an 'inline' library meaning in optimized code you shouldn't be passing variables much and instead just computing the value inside your larger function. The D3DXMath library in the deprecated D3DX9, D3DX10, D3DX11 library is more old-school which relies on function-pointer tables and is heavily performance bound by the calling-convention overhead.
These of course represent different engineering trade-offs. D3DXMath was able to do more substitution at runtime of specialized processor code paths, but pays for this flexibility with the calling-convention and indirection overhead. DirectXMath, on the other hand, assumes a SIMD baseline of SSE/SSE2 (or AVX on Xbox One) so you avoid the need for runtime detection or indirection and instead aggressively utilize inlining.

Is a reduction or atomic operation on mat/vec types with OpenGL compute shader possible?

Is it possible to do reduction/update or atomic operations in the computer shader on e.g. mat3, vec3 data types?
Like this scheme:
some_type mat3 A;
void main() {
A += mat3(1);
}
I have tried out to use shader storage buffer objects (SSBO) but it seems like the update is not atomic (at least I get wrong results when I read back the buffer).
Does anyone have an idea to realize this? Maybe creating a tiny 3x3 image2D and store the result by imageAtomicAdd in there?

There are buffer-based atomics in GLES 3.1.
https://www.khronos.org/registry/gles/specs/3.1/es_spec_3.1.pdf
Section 7.7.
Maybe creating a tiny 3x3 image2D and store the result by imageAtomicAdd in there?
Image atomics are not core and require an extension.
Thank you for the links. I forgot to mention that I work with ARM Mali GPUs and as such they do not expose TLP and do not have warps/wave fronts as Nvidia or AMD. That is, I might have to figure out another quick way.
The techniques proposed in the comments for your post (in particular the log(N) divisor approach where you fold the top half of the results down) still work fine on Mali. The technique doesn't rely on warps/wavefronts - as the original poster said, you just need synchronization (e.g. use a barrier() rather than relying on the implicit barrier which wavefronts would give you).

Impact of separating class definition from declaration on program size

I am working on a microcontroller with tight memory constraints. Hence I watch memory consumption.
I have some library with classes which are only visible in the cpp file. These class do not show up in the header file. The classes were directly implemented. Now I started to separate the declaration from the implementation. The point is that I need to expose some of them in the header file. However I noticed that this separation affects the program size. For some of the classes it increases memory consumption for some it decreases it.
Why is it that separation of definition and implementation affects compiled program size? How might I leverage this to decrease compiled program size?

When a class is only used inside a single translation unit (file) the compiler is free to perform whatever optimisations it likes. It can completely get rid of the v-table, split the class up and turn it into a more procedural structure if this works better. When you export the class outside, the compiler can't make assumptions about who might be using it and so the optimisations it can perform are more limited.
However, particularly on microcontrollers, there are lots of aggressive post linker optimisations such as procedural abstraction that might be done to reduce code size on the finished program. Sometimes if the compiler has optimised the separate modules less due to the situation described above, bigger gains can be achieved at this stage as there is more unoptimised repeated code blocks.
These days extra memory is so cheap it is rarely worth trying to write your program around saving a few bytes. Having a clearer and easier to maintain code base will quickly pay for any BOM savings at the first instance you have to add new features. If you really want to carefully control memory usage then I'd recommend moving to C (or an extremely limited subset of C++) and getting a really good understanding of how your compiler is optimising.

Static call graph generation for the Linux kernel

I'm looking for a tool to statically generate a call graph of the Linux kernel (for a given kernel configuration). The generated call graph should be "complete", in the sense that all calls are included, including potential indirect ones which we can assume are only done through the use of function pointers in the case of the Linux kernel.
For instance, this could be done by analyzing the function pointer types: this approach would lead to superfluous edges in the graph, but that's ok for me.
ncc seems to implement this idea, however I didn't succeed in making it work on the 3.0 kernel. Any other suggestions?
I'm guessing this approach could also lead to missing edges in cases where function pointer casts are used, so I'd also be interested in knowing whether this is likely in the Linux kernel.
As a side note, there seems to be other tools that are able to do semantic analysis of the source to infer potential pointer values, but AFAICT, none of them are design to be used in a project such as the Linux kernel.
Any help would be much appreciated.

We've done global points-to analysis (with indirect function pointers) and full call graph construction of monolithic C systems of 26 million lines (18,000 compilation units).
We did it using our DMS Software Reengineering Toolkit, its C Front End and its associated flow analysis machinery. The points-to analysis machinery (and the other analyses) are conservative; yes, you get some bogus points-to and therefore call edges as a consequence. These are pretty hard to avoid.
You can help such analyzers by providing certain crucial facts about key functions, and by harnessing knowledge such as "embedded systems [and OSes] tend not to have cycles in the call graph", which means you can eliminate some of these. Of course, you have to allow for exceptions; my moral: "in big systems, everything happens."
The particular problem included dynamically loaded(!) C modules using a special loading scheme specific to this particular software, but that just added to the problem.
Casts on function pointers shouldn't lose edges; a conservative analysis should simply assume that the cast pointer matches any function in the system with signature corresponding to the casted result. More problematic are casts which produce sort-of-compatible signatures; if you cast a function pointer to void* foo(uint) when the actual function being called accepts an int, the points to analysis will necessarily conservatively choose the wrong functions. You can't blame the analyzer for that; the cast lies in that case. Yes, we saw this kind of trash in the 26 million line system.
This is certainly the right scale for analyzing Linux (which I think is a mere 8 million lines or so :-). But we haven't tried it specifically on Linux.
Setting up this tool is complicated because you have to capture all the details about the compilations themselves, and in particular the configuration of the Linux kernal you want to generate. So you pretty much have to intercept the compiler calls to get the command line switches, etc.

BLAS and CUBLAS

I'm wondering about NVIDIA's cuBLAS Library. Does anybody have experience with it? For example if I write a C program using BLAS will I be able to replace the calls to BLAS with calls to cuBLAS? Or even better implement a mechanism which let's the user choose at runtime?
What about if I use the BLAS Library provided by Boost with C++?

The answer by janneb is incorrect, cuBLAS is not a drop-in replacement for a CPU BLAS. It assumes data is already on the device, and the function signatures have an extra parameter to keep track of a cuBLAS context.
However, coming in CUDA 6.0 is a new library called NVBLAS which provides exactly this "drop-in" functionality. It intercepts Level3 BLAS calls (GEMM, TRSV, etc) and automatically sends them to the GPU, effectively tiling the PCIE transfer with on-GPU computation.
There is some information here: https://developer.nvidia.com/cublasxt, and CUDA 6.0 is available to CUDA registered developers today.
Full docs will be online once CUDA 6.0 is released to the general public.

CUBLAS does not wrap around BLAS.
CUBLAS also accesses matrices in a column-major ordering, such as some Fortran codes and BLAS.
I am more used to writing code in C, even for CUDA.
A code written with CBLAS (which is a C wrap of BLAS) can easily be change into a CUDA code.
Be aware that Fortran codes that use BLAS are quite different from C/C++ codes that use CBLAS.
Fortran and BLAS normally store matrices or double arrays in column-major ordering,
but C/C++ normally handle Row-major ordering.
I normally handle this problem writing saving the matrices in a 1D arrays,
and use #define to write a macro toa access the element i,j of a matrix as:
/* define macro to access Aij in the row-wise array A[M*N] */
#define indrow(ii,jj,N) (ii-1)*N+jj-1 /* does not depend on rows M */
/* define macro to access Aij in the col-wise array A[M*N] */
#define indcol(ii,jj,M) (jj-1)*M+ii-1
CBLAS library has a well organize parameters and conventions (const enum variables)
to give to each function the ordering of the matrix.
Beware that also the storage of matrices vary, a row-wise banded matrix is not stored the same as a column-wise band matrix.
I don't think there are mechanics to allow the user to choose between using BLAS or CUBLAS,
without writing the code twice.
CUBLAS also has on most function calls a "handle" variable that does not appear on BLAS.
I though of #define to change the name at each function call, but this might not work.

I've been porting BLAS code to CUBLAS. The BLAS library I use is ATLAS, so what I say may be correct only up to choice of BLAS library.
ATLAS BLAS requires you to specify if you are using Column major ordering or row major ordering, and I chose column major ordering since I was using CLAPACK which uses column major ordering. LAPACKE on the other hand would use row major ordering. CUBLAS is column major ordering. You may need to adjust accordingly.
Even if ordering is not an issue porting to CUBLAS was by no means a drop in replacement. The largest issue is that you must move the data onto and off of the GPU's memory space. That memory is setup using cudaMalloc() and released with cudaFree() which acts as one might expect. You move data into GPU memory using cudaMemcpy(). The time to do this will be a large determining factor on if it's worthwhile to move from CPU to GPU.
Once that's done however, the calls are fairly similar. CblasNoTrans becomes CUBLAS_OP_N and CblasTrans becomes CUBLAS_OP_T. If your BLAS library (as ATLAS does) allows you to pass scalars by value you will have to convert that to pass by reference (as is normal for FORTRAN).
Given this, any switch that allows for a choice of CPU/GPU would most easily be at a higher level than within the function using BLAS. In my case I have CPU and GPU variants of the algorithm and chose them at a higher level depending on the size of the problem.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio