I am looking at the source of Eigen library, and saw a function named gemm_pack_rhs. Does anybody know what does this function do? And I saw a lot of place mentioning rhs. What does it mean in Eigen library?
To complete patatahooligan's answer, the gemm_pack_rhs function is used internally within matrix-matrix products to copy some blocks of the right-hand-side to a special memory layout suitable for efficient SIMD computations. There is an analogue function for the left-hand-side.
Related
I need to do a bunch of matrix operations on an MCU (ARM Cortex M). ARM provides a set of basic matrix operations in CMSIS-DSP, in varieties optimized for instruction sets available on different MCUs, but not including some of the advanced operations I need. CMSIS-DSP is great because you can get much better performance on more advanced MCUs without code change (I think). Eigen looks great, but...
I'm wondering if Eigen can be used on top of CMSIS-DSP, by providing wrapper classes for CMSIS-DSP matrices with interfaces required by Eigen.
Has anyone done this? Good or Bad idea?
Related, can Eigen use only static or stack allocation (no heap) and ideally no exceptions?
Alternatively, is there another matrix library more suited to MCUs? Simunova seems to have disappeared though there's a copy of MTL4 on Github. Any pointers appreciated!
My goal is to run a simulation that requires non-integral numbers across different machines that might have a varying CPU architectures and OSes. The main priority is that given the same initial state, each machine should reproduce the simulation exactly the same. Secondary priority is that I'd like the calculations to have performance and precision as close as realistically possible to double-precision floats.
As far as I can tell, there doesn't seem to be any way to affect the determinism of floating
point calculations from within a Haskell program, similar to the _controlfp and _FPU_SETCW macros in C. So, at the moment I consider my options to be
Use Data.Ratio
Use Data.Fixed
Use Data.Fixed.Binary from the fixed-point package
Write a module to call _ controlfp (or the equivivalent for each platform) via FFI.
Possibly, something else?
One problem with the fixed point arithmetic libraries is that they don't have e.g. trigonometric functions or logarithms defined for them (as they don't implement the Floating type-class) so I guess I would need to provide lookup tables for all the functions in the simulation seed data. Or is there some better way?
Both of the fixed point libraries also hide the newtype constructor, so any (de-)serialization would need to be done via toRational/fromRational as far as I can tell, and that feels like it would add unnecessary overhead.
My next step is to benchmark the different fixed-point solutions to see the real world performance, but meanwhile, I'd gladly take any advice you have on this subject.
Clause 11 of the IEEE 754-2008 standard describes what is needed for reproducible floating-point results. Among other things, you need unambiguous expression evaluation rules. Some languages permit floating-point expressions to be evaluated with extra precision or permit some alterations of expressions (such as evaluating a*b+c in a single instruction instead of separate multiply and add instructions). I do not know about Haskell’s semantics. If Haskell does not precisely map expressions to definite floating-point operations, then it cannot support reproducible floating-point results.
Also, since you mention trigonometric and logarithmic functions, be aware that these vary from implementation to implementation. I am not aware of any math library that provides correctly rounded implementations of every standard math function. (CRLibm is a project to create one.) So each math library uses its own approximations, and their results vary slightly. Perhaps you might work around this by including a math library with your simulation code, so that it is used instead of each Haskell implementation’s default library.
Routines that convert between binary floating-point and decimal numerals are also a source of differences between implementations. This is less of a problem than it used to be because algorithms for converting correctly are known. However, it is something that might need to be checked in each implementation.
Would an ODE solver written in C perhaps using the GSL library have significant speed advantages compared with Mathematica 8.0 NDSolve? How would it fair in terms of accuracy?
My understanding is that compiled code could in principle be faster, but that these days NDSolve uses a lot of compiled code itself already somehow?
Also are there any options for using things like MathLink or Mathematica's compile function to speed solving an ODE up?
NDSolve and other numerical functions in Mathematica automatically compile your operand (e.g. the RHS of an ODE) to an intermediate "bytecode" language (the same one used by the Compile function). If you like you can specify CompilationTarget -> "C" and the function will be compiled all the way to C code and linked back in to Mathematica... You can see the generated C code yourself in this previous question on the Mathematica Stack Exchange:
https://mathematica.stackexchange.com/questions/821/how-well-does-mathematica-code-exported-to-c-compare-to-code-directly-written-fo/830#830
Of course, it's always possible in principle to hand-write a faster algorithm... But there are a lot of things to optimize that Mathematica will do automatically. You probably don't want to be responsible for manually optimizing the computation of a sparse matrix of partial derivatives in an optimization problem for example.
Mathematica's focus is on usability. They do use numerical libraries. So the speed would be the same as the best available library or worse (in almost all cases). for example, i heard they use eigen for matrix stuff.
the other thing that you should consider is that although they optimize functions that they provide, your own functions are not optimized. so the derivative that you calculate at each step would be faster in c.
to my friends that decide between mathematica and c++, i tell to go with mathematica since they should focus on getting results fast rather than building the fastest code.
Does __attribute__((always_inline)) force a function to be inlined by gcc?
Yes.
From documentation v4.1.2
From documentation latest
always_inline
Generally, functions are not inlined unless optimization is specified. For functions declared inline, this attribute inlines the function even if no optimization level was specified.
It should. I'm a big fan of manual inlining. Sure, used in excess it's a bad thing. But often times when optimizing code, there will be one or two functions that simply have to be inlined or performance goes down the toilet. And frankly, in my experience C compilers typically do not inline those functions when using the inline keyword.
I'm perfectly willing to let the compiler inline most of my code for me. It's only those half dozen or so absolutely vital cases that I really care about. People say "compilers do a good job at this." I'd like to see proof of that, please. So far, I've never seen a C compiler inline a vital piece of code I told it to without using some sort of forced inline syntax (__forceinline on msvc __attribute__((always_inline)) on gcc).
Yes, it will. That doesn't necessarily mean it's a good idea.
According to the gcc optimize options documentation, you can tune inlining with parameters:
-finline-limit=n
By default, GCC limits the size of functions that can be inlined. This flag
allows coarse control of this limit. n is the size of functions that can be
inlined in number of pseudo instructions.
Inlining is actually controlled by a number of parameters, which may be specified
individually by using --param name=value. The -finline-limit=n option sets some
of these parameters as follows:
max-inline-insns-single is set to n/2.
max-inline-insns-auto is set to n/2.
I suggest reading more in details about all the parameters for inlining, and setting them appropriately.
I want to add here that I have a SIMD math library where inlining is absolutely critical for performance. Initially I set all functions to inline but the disassembly showed that even for the most trivial operators it would decide to actually call the function. Both MSVC and Clang showed this, with all optimization flags on.
I did as suggested in other posts in SO and added __forceinline for MSVC and __attribute__((always_inline)) for all other compilers. There was a consistent 25-35% improvement in performance in various tight loops with operations ranging from basic multiplies to sines.
I didn't figure out why they had such a hard time inlining (perhaps templated code is harder?) but the bottom line is: there are very valid use cases for inlining manually and huge speedups to be gained.
If you're curious this is where I implemented it. https://github.com/redorav/hlslpp
Yes. It will inline the function regardless of any other options set. See here.
One can also use __always_inline. I have been using that for C++ member functions for GCC 4.8.1. But could not found a good explanation in GCC doc.
Actually the answer is "no". All it means is that the function is a candidate for inlining even with optimizations disabled.
I'm wondering about NVIDIA's cuBLAS Library. Does anybody have experience with it? For example if I write a C program using BLAS will I be able to replace the calls to BLAS with calls to cuBLAS? Or even better implement a mechanism which let's the user choose at runtime?
What about if I use the BLAS Library provided by Boost with C++?
The answer by janneb is incorrect, cuBLAS is not a drop-in replacement for a CPU BLAS. It assumes data is already on the device, and the function signatures have an extra parameter to keep track of a cuBLAS context.
However, coming in CUDA 6.0 is a new library called NVBLAS which provides exactly this "drop-in" functionality. It intercepts Level3 BLAS calls (GEMM, TRSV, etc) and automatically sends them to the GPU, effectively tiling the PCIE transfer with on-GPU computation.
There is some information here: https://developer.nvidia.com/cublasxt, and CUDA 6.0 is available to CUDA registered developers today.
Full docs will be online once CUDA 6.0 is released to the general public.
CUBLAS does not wrap around BLAS.
CUBLAS also accesses matrices in a column-major ordering, such as some Fortran codes and BLAS.
I am more used to writing code in C, even for CUDA.
A code written with CBLAS (which is a C wrap of BLAS) can easily be change into a CUDA code.
Be aware that Fortran codes that use BLAS are quite different from C/C++ codes that use CBLAS.
Fortran and BLAS normally store matrices or double arrays in column-major ordering,
but C/C++ normally handle Row-major ordering.
I normally handle this problem writing saving the matrices in a 1D arrays,
and use #define to write a macro toa access the element i,j of a matrix as:
/* define macro to access Aij in the row-wise array A[M*N] */
#define indrow(ii,jj,N) (ii-1)*N+jj-1 /* does not depend on rows M */
/* define macro to access Aij in the col-wise array A[M*N] */
#define indcol(ii,jj,M) (jj-1)*M+ii-1
CBLAS library has a well organize parameters and conventions (const enum variables)
to give to each function the ordering of the matrix.
Beware that also the storage of matrices vary, a row-wise banded matrix is not stored the same as a column-wise band matrix.
I don't think there are mechanics to allow the user to choose between using BLAS or CUBLAS,
without writing the code twice.
CUBLAS also has on most function calls a "handle" variable that does not appear on BLAS.
I though of #define to change the name at each function call, but this might not work.
I've been porting BLAS code to CUBLAS. The BLAS library I use is ATLAS, so what I say may be correct only up to choice of BLAS library.
ATLAS BLAS requires you to specify if you are using Column major ordering or row major ordering, and I chose column major ordering since I was using CLAPACK which uses column major ordering. LAPACKE on the other hand would use row major ordering. CUBLAS is column major ordering. You may need to adjust accordingly.
Even if ordering is not an issue porting to CUBLAS was by no means a drop in replacement. The largest issue is that you must move the data onto and off of the GPU's memory space. That memory is setup using cudaMalloc() and released with cudaFree() which acts as one might expect. You move data into GPU memory using cudaMemcpy(). The time to do this will be a large determining factor on if it's worthwhile to move from CPU to GPU.
Once that's done however, the calls are fairly similar. CblasNoTrans becomes CUBLAS_OP_N and CblasTrans becomes CUBLAS_OP_T. If your BLAS library (as ATLAS does) allows you to pass scalars by value you will have to convert that to pass by reference (as is normal for FORTRAN).
Given this, any switch that allows for a choice of CPU/GPU would most easily be at a higher level than within the function using BLAS. In my case I have CPU and GPU variants of the algorithm and chose them at a higher level depending on the size of the problem.