conflicts of cblas APIs definition between OpenBLAS and Eigen 3.3.4 - eigen

I am newbie to Eigen and hope to use OpenBLAS as a backend of Eigen 3.3.4 on Android/ARMv7. From the following site I tried a test to use them in one application (Compiling environment is ubuntu 16.04 + Android NDK r15c.),
http://eigen.tuxfamily.org/dox-devel/TopicUsingBlasLapack.html
gemm.cpp has code as follow,
#include <iostream>
#include <Eigen/Dense>
#include "cblas.h"
using namespace Eigen;
int main()
{
double A[6] = {1.0, 2.0, 1.0,-3.0, 4.0,-1.0};
double B[6] = {1.0, 2.0, 1.0,-3.0, 4.0,-1.0};
double C[9] = {.5 , .5 , .5 , .5 , .5 , .5 , .5 , .5 , .5};
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasTrans, 3, 3, 2, 1, A, 3, B, 3, 2, C, 3);
return 0;
}
My Android.mk looks like this,
LOCAL_PATH := $(call my-dir)
#build a test executable
include $(CLEAR_VARS)
LOCAL_MODULE := gemm
LOCAL_C_INCLUDES := /home/yangfan/workspace/study/eigen-3.3.4
LOCAL_C_INCLUDES += /home/yangfan/workspace/study/openBLAS
LOCAL_SRC_FILES := $(LOCAL_PATH)/gemm.cpp
LOCAL_CFLAGS += -DEIGEN_USE_BLAS
LOCAL_CFLAGS += -fPIC -frtti -fexceptions -lz -O3
LOCAL_LDLIBS += -lm -llog -lz
LOCAL_LDLIBS += $(LOCAL_PATH)/openblas-libs/libopenblas.a
include $(BUILD_EXECUTABLE)
When trying to compile the project, I encountered errors as follow(I picked one to paste here),
In file included from ././gemm.cpp:4:
In file included from /home/yangfan/workspace/study/openBLAS/cblas.h:5:
In file included from /home/yangfan/workspace/study/openBLAS/common.h:751:
/home/yangfan/workspace/study/openBLAS/common_interface.h:105:9: error: functions that differ only in their return type cannot be overloaded
void BLASFUNC(dcopy) (blasint *, double *, blasint *, double *, blasint *);
/home/yangfan/workspace/study/eigen-3.3.4/Eigen/src/Core/util/../../misc/blas.h:44:8: note: previous declaration is here
int BLASFUNC(dcopy) (int *, double *, int *, double *, int *);
There are different return types for the same blas functions in openblas and eigen.
Q1. Why are there different return types for the same blas APIs in OpenBLAS and Eigen?
Q2. Is there something missing? Hope some guides to use OpenBLAS as a backend of Eigen.
Q3. Which version is higher, 3.3.4 or 3.3.90? ^-^
thanks so much for your help.

It seems to have two interface styles - fortran blas hand cblas. Eigen only supports fortran blas calls and developers need not provide a header file. I should remove Eigen/Dense if only using cblas functions.
But I still am puzzled at the issue. Why do Eigen and openblas define different return types for fortran-style functions?
in common_interface.h of OpenBLAS,
void BLASFUNC(dgemm)(char *, char *, blasint *, blasint *, blasint *, double *,
double *, blasint *, double *, blasint *, double *, double *, blasint *);
in misc/blas.h of Eigen,
int BLASFUNC(dgemm)(const char *, const char *, const int *, const int *, const int *, const double *,
const double *, const int *, const double *, const int *, const double *, double *, const int *);

Related

Intel C++ compiler failing to select template function overload

The following code in C++11 compiles correctly with g++ 6.3.0 and result in the behavior that I consider correct (namely, the first function is picked). However with Intel's C++ compiler (icc 17.0.4) it fails to compile; the compiler indicates that multiple possible function overloads exist.
#include <iostream>
template<typename R, typename ... Args>
static void f(R(func)(const int&, Args...)) {
std::cout << "In first version of f" << std::endl;
}
template<typename R, typename ... Args, typename X = typename std::is_void<R>::type>
static void f(R(func)(Args...), X x = X()) {
std::cout << "In second version of f" << std::endl;
}
double h(const int& x, double y) {
return 0;
}
int main(int argc, char** argv) {
f(h);
return 0;
}
Here is the error reported by icc:
test.cpp(18): error: more than one instance of overloaded function "f" matches the argument list:
function template "void f(R (*)(const int &, Args...))"
function template "void f(R (*)(Args...), X)"
argument types are: (double (const int &, double))
f(h);
^
So my two questions are: which compiler is correct with respect to the standard? and how would you modify this code so that it compiles? (note that f is a user-facing API and I would like to avoid modifying its prototype).
Note that if I remove typename X = typename std::is_void<R>::type and the X x = X() argument in the second version of f, icc compiles it fine.
This is a bug in Intel Compiler 17.0 Update 4. The behavior of GCC is right in this case. This issue is resolved in Intel Compiler 18.0 Update 4 and above.

cmake finds cuda but fails to find cuda libraries on Windows

I have a small cmake project that works perfectly well on Linux but fails on Windows 10 (I tried with two different computers) with the latest versions of cmake and CUDA 8. It finds CUDA just fine, but fails to find the libraries. My cmake file:
cmake_minimum_required(VERSION 3.0)
project(myproject)
find_package(CUDA REQUIRED)
cuda_add_library(myproject STATIC matrix_mm.cu)
target_link_libraries(myproject ${CUDA_CUBLAS_LIBRARIES})
message(STATUS "")
message(STATUS "FoundCUDA : ${CUDA_FOUND}")
message(STATUS "Cuda cublas libraries : ${CUDA_CUBLAS_LIBRARIES}")
In the same folder, I have the header matrix_mm.cuh:
#include <cstdlib>
namespace myproject {
float* cuda_mm(const float *a, const float *b, const size_t m, const size_t k, const size_t n);
} /* end namespace myproject */
and matrix_mm.cu:
#include <cublas_v2.h>
#include "matrix_mm.cuh"
namespace myproject {
// Adapted from https://solarianprogrammer.com/2012/05/31/matrix-multiplication-cuda-cublas-curand-thrust/
void gpu_blas_mmul(const float *a, const float *b, float *c, const size_t m, const size_t k, const size_t n) {
int lda = m, ldb = k, ldc = m;
const float alf = 1;
const float bet = 0;
const float *alpha = &alf;
const float *beta = &bet;
// Create a handle for CUBLAS
cublasHandle_t handle;
cublasCreate(&handle);
// Do the actual multiplication
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
// Destroy the handle
cublasDestroy(handle);
}
float* cuda_mm(const float *a, const float *b, const size_t m, const size_t k, const size_t n) {
size_t const a_bytes = m * k * sizeof(float);
size_t const b_bytes = k * n * sizeof(float);
size_t const c_bytes = m * n * sizeof(float);
float* c = (float*)std::malloc(c_bytes);
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, a_bytes);
cudaMalloc(&d_B, b_bytes);
cudaMalloc(&d_C, c_bytes);
cudaMemcpy(d_A, a, a_bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, b, b_bytes, cudaMemcpyHostToDevice);
gpu_blas_mmul(d_A, d_B, d_C, m, k, n);
cudaMemcpy(c, d_C, c_bytes, cudaMemcpyDeviceToHost);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
return c;
}
} /* end namespace myproject */
On Linux I get:
-- FoundCUDA : TRUE
-- Toolkit root : /usr
-- Cuda cublas libraries : /usr/lib/x86_64-linux-gnu/libcublas.so
While on both Windows 10 machines I get
-- FoundCUDA : TRUE
-- Toolkit root : C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v8.0
-- Cuda cublas libraries : CUDA_cublas_LIBRARY-NOTFOUND;CUDA_cublas_device_LIBRARY-NOTFOUND
...and of course it fails to compile because the linker can't find cublas.
I tried quite a few things: making the lib SHARED instead of STATIC, I made sure Cuda was in Windows' environmental variables, etc, but nothing works.
This is the CMake snippet I use to find CUDA 8 on Windows 10 with CMake 3.7.1:
cmake_minimum_required(VERSION 3.7)
project(myproject)
# Check for CUDA ENV vars
IF(NOT DEFINED ENV{CUDA_PATH})
MESSAGE(FATAL_ERROR "CUDA_PATH Environment variable is not set.")
ENDIF(NOT DEFINED ENV{CUDA_PATH})
# Set the toolkit path
FILE(TO_CMAKE_PATH "$ENV{CUDA_PATH}" CUDA_TOOLKIT_ROOT_DIR)
SET(CUDA_TOOLKIT_ROOT_DIR ${CUDA_TOOLKIT_ROOT_DIR} CACHE STRING "Root directory of the Cuda Library" FORCE)
# Find the package
find_package(CUDA REQUIRED)
# Create and interface library as a link target (Requires CMake 3.7.0+)
add_library(cuda INTERFACE)
set_target_properties(cuda PROPERTIES
INTERFACE_INCLUDE_DIRECTORIES ${CUDA_INCLUDE_DIRS}
INTERFACE_LINK_LIBRARIES "${CUDA_LIBRARIES};${CUDA_CUFFT_LIBRARIES};${CUDA_CUBLAS_LIBRARIES}"
)
SET(CUDA_HOST_COMPILATION_CPP ON)
cuda_add_library(myproject STATIC matrix_mm.cu)
target_link_libraries(myproject cuda)
I think the quotes around the libraries are important since the Windows path will contain spaces.
I would also make sure that you delete your cache and regenerate the project. It's often the cause of errors when variable values appear correct on the surface (or when you make changes to a non-FORCE cache variable).

GCC Vector Extensions Sqrt

I am currently experimenting with the GCC vector extensions. However, I am wondering how to go about getting sqrt(vec) to work as expected.
As in:
typedef double v4d __attribute__ ((vector_size (16)));
v4d myfunc(v4d in)
{
return some_sqrt(in);
}
and at least on a recent x86 system have it emit a call to the relevant intrinsic sqrtpd. Is there a GCC builtin for sqrt that works on vector types or does one need to drop down to the intrinsic level to accomplish this?
Looks like it's a bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54408 I don't know of any workaround other than do it component-wise. The vector extensions were never meant to replace platform specific intrinsics anyway.
Some funky code to this effect:
#include <cmath>
#include <utility>
template <::std::size_t...> struct indices { };
template <::std::size_t M, ::std::size_t... Is>
struct make_indices : make_indices<M - 1, M - 1, Is...> {};
template <::std::size_t... Is>
struct make_indices<0, Is...> : indices<Is...> {};
typedef float vec_type __attribute__ ((vector_size(4 * sizeof(float))));
template <::std::size_t ...Is>
vec_type sqrt_(vec_type const& v, indices<Is...> const)
{
vec_type r;
::std::initializer_list<int>{(r[Is] = ::std::sqrt(v[Is]), 0)...};
return r;
}
vec_type sqrt(vec_type const& v)
{
return sqrt_(v, make_indices<4>());
}
int main()
{
vec_type v;
return sqrt(v)[0];
}
You could also try your luck with auto-vectorization, which is separate from the vector extension.
You can loop over the vectors directly
#include <math.h>
typedef double v2d __attribute__ ((vector_size (16)));
v2d myfunc(v2d in) {
v2d out;
for(int i=0; i<2; i++) out[i] = sqrt(in[i]);
return out;
}
The sqrt function has to trap for signed zero and NAN but if you avoid these with -Ofast both Clang and GCC produce simply sqrtpd.
https://godbolt.org/g/aCuovX
GCC might have a bug because I had to loop to 4 even though there are only 2 elements to get optimal code.
But with AVX and AVX512 GCC and Clang are ideal
AVX
https://godbolt.org/g/qdTxyp
AVX512
https://godbolt.org/g/MJP1n7
My reading of the question is that you want the square root of 4 packed double precision values... that's 32 bytes. Use the appropriate AVX intrinsic:
#include <x86intrin.h>
typedef double v4d __attribute__ ((vector_size (32)));
v4d myfunc (v4d v) {
return _mm256_sqrt_pd(v);
}
x86-64 gcc 10.2 and x86-64 clang 10.0.1
using -O3 -march=skylake :
myfunc:
vsqrtpd %ymm0, %ymm0 # (or just `ymm0` for Intel syntax)
ret
ymm0 is the return value register.
That said, it just so happens there is a builtin: __builtin_ia32_sqrtpd256, which doesn't require the intrinsics header. I would definitely discourage its use however.

Compiling GSL odeiv2 with g++

I'm attempting to compile the example code relating to the ODE solver, gsl/gsl_odeiv2, using g++. The code below is from their website :
http://www.gnu.org/software/gsl/manual/html_node/ODE-Example-programs.html
and compiles fine under gcc, but g++ throws the error
invalid conversion from 'void*' to 'int (*)(double, const double*, double*, double*,
void*)' [-fpermissive]
in the code :
#include <stdio.h>
#include <gsl/gsl_errno.h>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_odeiv2.h>
int func (double t, const double y[], double f[], void *params)
{
double mu = *(double *)params;
f[0] = y[1];
f[1] = -y[0] - mu*y[1]*(y[0]*y[0] - 1);
return GSL_SUCCESS;
}
int * jac;
int main ()
{
double mu = 10;
gsl_odeiv2_system sys = {func, jac, 2, &mu};
gsl_odeiv2_driver * d = gsl_odeiv2_driver_alloc_y_new (&sys, gsl_odeiv2_step_rkf45, 1e-6, 1e-6, 0.0);
int i;
double t = 0.0, t1 = 100.0;
double y[2] = { 1.0, 0.0 };
for (i = 1; i <= 100; i++)
{
double ti = i * t1 / 100.0;
int status = gsl_odeiv2_driver_apply (d, &t, ti, y);
if (status != GSL_SUCCESS)
{
printf ("error, return value=%d\n", status);
break;
}
printf ("%.5e %.5e %.5e\n", t, y[0], y[1]);
}
gsl_odeiv2_driver_free (d);
return 0;
}
The error is given on the line
gsl_odeiv2_system sys = {func, jac, 2, &mu};
Any help in solving this issue would be fantastic. I'm hoping to include some stdlib elements, hence wanting to compile it as C++. Also, if I can get it to compile with g++-4.7, I could more easily multithread it using C++11's additions to the language. Thank you very much.
It looks like you have some problems with Jacobian. In your particular case you could just use NULL instead of jac in the definition of your system, i.e.
gsl_odeiv2_system sys = {func, NULL, 2, &mu};
In general you Jacobian must be a function with particular entries - see gsl manual - that is why your compiler is complaining.
Also, you may want to link the gsl library manually:
-L/usr/local/lib -lgsl
if you are on a linux system.

SSE (SIMD extensions) support in gcc

I see a code as below:
#include "stdio.h"
#define VECTOR_SIZE 4
typedef float v4sf __attribute__ ((vector_size(sizeof(float)*VECTOR_SIZE)));
// vector of four single floats
typedef union f4vector
{
v4sf v;
float f[VECTOR_SIZE];
} f4vector;
void print_vector (f4vector *v)
{
printf("%f,%f,%f,%f\n", v->f[0], v->f[1], v->f[2], v->f[3]);
}
int main()
{
union f4vector a, b, c;
a.v = (v4sf){1.2, 2.3, 3.4, 4.5};
b.v = (v4sf){5., 6., 7., 8.};
c.v = a.v + b.v;
print_vector(&a);
print_vector(&b);
print_vector(&c);
}
This code builds fine and works expectedly using gcc (it's inbuild SSE / MMX extensions and vector data types. this code is doing a SIMD vector addition using 4 single floats.
I want to understand in detail what does each keyword/function call on this typedef line means and does:
typedef float v4sf __attribute__ ((vector_size(sizeof(float)*VECTOR_SIZE)));
What is the vector_size() function return;
What is the __attribute__ keyword for
Here is the float data type being type defined to vfsf type?
I understand the rest part.
thanks,
-AD
__attribute__ is GCCs way of exposing functionality from the compiler that isn't in the C or C++ standards.
__attribute__((vector_size(x))) instructs GCC to treat the type as a vector of size x. For SSE this is 16 bytes.
However, I would suggest using the __m128, __m128i or __m128d types found in the various <*mmintrin.h> headers. They are more portable across compilers.

Resources