BLAS and CUBLAS - boost

I'm wondering about NVIDIA's cuBLAS Library. Does anybody have experience with it? For example if I write a C program using BLAS will I be able to replace the calls to BLAS with calls to cuBLAS? Or even better implement a mechanism which let's the user choose at runtime?
What about if I use the BLAS Library provided by Boost with C++?

The answer by janneb is incorrect, cuBLAS is not a drop-in replacement for a CPU BLAS. It assumes data is already on the device, and the function signatures have an extra parameter to keep track of a cuBLAS context.
However, coming in CUDA 6.0 is a new library called NVBLAS which provides exactly this "drop-in" functionality. It intercepts Level3 BLAS calls (GEMM, TRSV, etc) and automatically sends them to the GPU, effectively tiling the PCIE transfer with on-GPU computation.
There is some information here: https://developer.nvidia.com/cublasxt, and CUDA 6.0 is available to CUDA registered developers today.
Full docs will be online once CUDA 6.0 is released to the general public.

CUBLAS does not wrap around BLAS.
CUBLAS also accesses matrices in a column-major ordering, such as some Fortran codes and BLAS.
I am more used to writing code in C, even for CUDA.
A code written with CBLAS (which is a C wrap of BLAS) can easily be change into a CUDA code.
Be aware that Fortran codes that use BLAS are quite different from C/C++ codes that use CBLAS.
Fortran and BLAS normally store matrices or double arrays in column-major ordering,
but C/C++ normally handle Row-major ordering.
I normally handle this problem writing saving the matrices in a 1D arrays,
and use #define to write a macro toa access the element i,j of a matrix as:
/* define macro to access Aij in the row-wise array A[M*N] */
#define indrow(ii,jj,N) (ii-1)*N+jj-1 /* does not depend on rows M */
/* define macro to access Aij in the col-wise array A[M*N] */
#define indcol(ii,jj,M) (jj-1)*M+ii-1
CBLAS library has a well organize parameters and conventions (const enum variables)
to give to each function the ordering of the matrix.
Beware that also the storage of matrices vary, a row-wise banded matrix is not stored the same as a column-wise band matrix.
I don't think there are mechanics to allow the user to choose between using BLAS or CUBLAS,
without writing the code twice.
CUBLAS also has on most function calls a "handle" variable that does not appear on BLAS.
I though of #define to change the name at each function call, but this might not work.

I've been porting BLAS code to CUBLAS. The BLAS library I use is ATLAS, so what I say may be correct only up to choice of BLAS library.
ATLAS BLAS requires you to specify if you are using Column major ordering or row major ordering, and I chose column major ordering since I was using CLAPACK which uses column major ordering. LAPACKE on the other hand would use row major ordering. CUBLAS is column major ordering. You may need to adjust accordingly.
Even if ordering is not an issue porting to CUBLAS was by no means a drop in replacement. The largest issue is that you must move the data onto and off of the GPU's memory space. That memory is setup using cudaMalloc() and released with cudaFree() which acts as one might expect. You move data into GPU memory using cudaMemcpy(). The time to do this will be a large determining factor on if it's worthwhile to move from CPU to GPU.
Once that's done however, the calls are fairly similar. CblasNoTrans becomes CUBLAS_OP_N and CblasTrans becomes CUBLAS_OP_T. If your BLAS library (as ATLAS does) allows you to pass scalars by value you will have to convert that to pass by reference (as is normal for FORTRAN).
Given this, any switch that allows for a choice of CPU/GPU would most easily be at a higher level than within the function using BLAS. In my case I have CPU and GPU variants of the algorithm and chose them at a higher level depending on the size of the problem.

Related

Eigen matrix template library over ARM CMSIS-DSP?

I need to do a bunch of matrix operations on an MCU (ARM Cortex M). ARM provides a set of basic matrix operations in CMSIS-DSP, in varieties optimized for instruction sets available on different MCUs, but not including some of the advanced operations I need. CMSIS-DSP is great because you can get much better performance on more advanced MCUs without code change (I think). Eigen looks great, but...
I'm wondering if Eigen can be used on top of CMSIS-DSP, by providing wrapper classes for CMSIS-DSP matrices with interfaces required by Eigen.
Has anyone done this? Good or Bad idea?
Related, can Eigen use only static or stack allocation (no heap) and ideally no exceptions?
Alternatively, is there another matrix library more suited to MCUs? Simunova seems to have disappeared though there's a copy of MTL4 on Github. Any pointers appreciated!

ARM softfp vs hardfp performance

I have an ARM based platform with a Linux OS. Even though its gcc-based toolchain supports both hardfp and softfp, the vendor recommends using softfp and the platform is shipped with a set of standard and platform-related libraries which have only softfp version.
I'm making a computation-intensive (NEON) AI code based on OpenCV and tensorflow lite. Following the vendor guide, I have built these with softfp option. However, I have a feeling that my code is underperformed compared to other somewhat alike hardfp platforms.
Does the code performance depend on softfp/hardfp setting? Do I understand it right that all .o and .a files the compiler makes to build my program are also using softfp convention, which is less effective? If it does, are there any tricky ways to use hardfp calling convention internally but softfp for external libraries?
Normally, all objects that are linked together need to have the same float ABI. So if you need to use this softfp only library, i'm afraid you have to compile your own software in softfp too.
I had the same question about mixing ABIs. See here
Regarding the performance: the performance lost with softfp compared to hardfp is that you will pass (floating point) function parameters through usual registers instead of using FPU registers. This requires some additional copy between registers. As old_timer said it is impossible to evaluate the performance lost. If you have a single huge function with many float operations, the performance will be the same. If you have many small function calls with many floating variables and few operations, the performance will be dramatically slower.
The softfp option only affects the parameter passing.
In other words, unless you are passing lots of float type arguments while calling functions, there won't be any measurable performance hit compared to hardfp.
And since well designed projects heavily rely on passing pointer to structures instead of many single values, I would stick to softfp.

Working with DirectXMath and D3DXMath

Back in D3DXMath we had ability to multiply, add or subtract even divide vector types, which were D3DXVECTOR2, D3DXVECTOR3, D3DXVECTOR4 structures.....
Now in DirectXMath incarnation we have XMFLOAT2, XMFLOAT3, XMFLOAT4 and XMVECTOR. If i want to do any math operation i must do conversion from XMFLOAT to XMVECTOR either way Visual Studio is throwing an error "There is no
user defined conversion". Why is that ? Actually it's a fact that in a new version(Windows 8.1, 10) of DirectX math library vector operation slightly has changed . Am i doing something wrong........... ?!
P.S. Well for Matrices there are another question but right now lets talk only on vectors. These changes is pushing third party developers to create their own Math library and they had done it..... :)
This is actually explained in detail in the DirectXMath Programmer's Guide on MSDN:
The XMVECTOR and XMMATRIX types are the work horses for the DirectXMath Library. Every operation consumes or produces data of these types. Working with them is key to using the library. However, since DirectXMath makes use of the SIMD instruction sets, these data types are subject to a number of restrictions. It is critical that you understand these restrictions if you want to make good use of the DirectXMath functions.
You should think of XMVECTOR as a proxy for a SIMD hardware register, and XMMATRIX as a proxy for a logical grouping of four SIMD hardware registers. These types are annotated to indicate they require 16-byte alignment to work correctly. The compiler will automatically place them correctly on the stack when they are used as a local variable, or place them in the data segment when they are used as a global variable. With proper conventions, they can also be passed safely as parameters to a function (see Calling Conventions for details).
Allocations from the heap, however, are more complicated. As such, you need to be careful whenever you use either XMVECTOR or XMMATRIX as a member of a class or structure to be allocated from the heap. On Windows x64, all heap allocations are 16-byte aligned, but for Windows x86, they are only 8-byte aligned. There are options for allocating structures from the heap with 16-byte alignment (see Properly Align Allocations). For C++ programs, you can use operator new/delete/new[]/delete[] overloads (either globally or class-specific) to enforce optimal alignment if desired.
However, often it is easier and more compact to avoid using XMVECTOR or XMMATRIX directly in a class or structure. Instead, make use of the XMFLOAT3, XMFLOAT4, XMFLOAT4X3, XMFLOAT4X4, and so on, as members of your structure. Further, you can use the Vector Loading and Vector Storage functions to move the data efficiently into XMVECTOR or XMMATRIX local variables, perform computations, and store the results. There are also streaming functions (XMVector3TransformStream, XMVector4TransformStream, and so on) that efficiently operate directly on arrays of these data types.
By design, DirectXMath is encouraging you to write efficient, SIMD-friendly code. Loading or storing a vector is expensive, so you should try to work in a 'stream' model where you load data, work with it in-register a lot, then write the results.
That said, I totally get that the usage is a little complex for people new to SIMD math or DirectX in general, and is a bit verbose even for professional developers. That's why I also wrote the SimpleMath wrapper for DirectXMath which makes it work more like the classic math library you are looking for using XNA Game Studio like Vector2, Vector3, Matrix classes with 'C++ magic' covering up all the explicit loads & stores. SimpleMath types interop neatly with DirectXMath, so you can mix and match as you want.
See this blog post and GitHub as well.
DirectXMath is purposely an 'inline' library meaning in optimized code you shouldn't be passing variables much and instead just computing the value inside your larger function. The D3DXMath library in the deprecated D3DX9, D3DX10, D3DX11 library is more old-school which relies on function-pointer tables and is heavily performance bound by the calling-convention overhead.
These of course represent different engineering trade-offs. D3DXMath was able to do more substitution at runtime of specialized processor code paths, but pays for this flexibility with the calling-convention and indirection overhead. DirectXMath, on the other hand, assumes a SIMD baseline of SSE/SSE2 (or AVX on Xbox One) so you avoid the need for runtime detection or indirection and instead aggressively utilize inlining.

Segmentation fault with automatic arrays [duplicate]

I have some Fortran code that calls RESHAPE to reorder a matrix such that the dimension that I am now about to loop over becomes the first varying dimension (Column-major order in Fortran).
This has nothing to do with C/Fortran interoperability.
Now the matrix is rather large and when I call the RESHAPE function I get a seg fault which I am very confident is a stack overflow. I know this because I can compile my code in ifort with -heap-arrays and the problem disappears.
I do not want to modify the stack-size. This code needs to be portable for any computer without the user having to concern himself with stack-size.
Is there someway I can get this call of the RESHAPE function to use the heap and not the stack for its internal memory use.
Worst case I will have to 'roll my own' RESHAPE function for this instance but I wish there was a better way.
The Fortran standard does not speak about stack and heap at all, that is an implementation detail. In which part of memory something is placed and whether there are any limits is implementation defined.
Therefore it is impossible to control the stack or heap behaviour from the Fortran code itself. The compiler must be instructed by other means if you want to specify this and the compiler options are used for that. Intel Fortran uses stack by default and has the -heap-arrays n option (n is the limit in kB), gfortran is slightly different and has the opposite -fstack-arrays option (included in -Ofast, but can be disabled).
This is valid for all kinds of temporaries and automatic arrays.

Alternative for dynamic parallelism for CUDA

I am very new to the CUDA programming model and programming in general, I suppose. I'm attempting to parallelize an expectation maximization algorithm. I am working on a gtx 480 which has compute capability 2.0. At first, I sort of assumed that there's no reason for the device to launch its own threads, but of course, I was sadly mistaken. I came across this pdf.
http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
Unfortunately, dynamic parallelism only works on the latest and greatest GPUs, with compute capability 3.5. Without diving into too much specifics, what is the alternative to dynamic parallelism? The loops in the CPU EM algorithm have many dependencies and are highly nested, which seems to make dynamic parallelism an attractive ability. I'm not sure if my question makes sense so please ask if you need clarification.
Thank you!
As indicated by #JackOLantern, dynamic parallelism can be described in a nutshell as the ability to call a kernel (i.e. a __global__ function) from device code (a __global__ or __device__ function).
Since the kernel call is the principal method by which the machine spins up multiple threads in response to a single function call, there is really no direct alternative that provides all the capability of dynamic parallelism in a device that does not support it (ie. pre cc 3.5 devices).
Without dynamic parallelism, your overall code will almost certainly involve more synchronization and communication between CPU code and GPU code.
The principal method would be to realize some unit of your code as parallelizable, convert it to a kernel, and work through your code in essentially a non-nested fashion. Repetetive functions might be done via looping in the kernel, or else looping in the host code that calls the kernel.
For a pictorial example of what I am trying to describe, please refer to slide 14 of this deck which introduces some of the new features of CUDA 5 including dynamic parallelism. The code architecture on the right is an algorithm realized with dynamic parallelism. The architecture on the left is the same function realized without dynamic parallelism.
I have checked your algorithm in Wikipedia and I'm not sure you need dynamic parallelism at all.
You do the expectation step in your kernel, __syncthreads(), do the maximization step, and __syncthreads() again. From this distance, the expectation looks like a reduction primitive, and the maximization is a filter one.
If it doesn't work, and you need real task parallelism, a GPU may not be the best choice. While the Kepler GPUs can do that to some degree, this is not what this architecture is designed for. In that case you might be better off using a multi-CPU system, such as an office grid, a supercomputer, or a Xeon Phi accelerator. You should also check OpenMP and MPI, these are the languages used for task-parallel programming (actually OpenMP is just a handful of pragmas in most cases).

Resources