SSE2: _mm_mul_ps fails on OS X in case of GCC 4.2 and O0 optimization - debugging

I am trying to calculate squared Euclidean distance between two 4d float vectors using SSE2. My os is Mac OS X 10.7 Lion.
When I use Apple LLVM compiler in XCode 4.5.2 everything is fine. But when I switch into GCC 4.2 in project's settings I have error EXC_BAD_ACCESS at _mm_mul_ps operation.
When I compile code from command line (g++ main.cpp) without additional arguments I have "Segmentation fault". But when I enable any optimization level (O1, O2, O3, Os) except O0 everything works.
I can not reproduce this issue on my Ubuntu 12.04 with GCC 4.6.3.
#include <stdio.h>
#include <emmintrin.h>
typedef float SPPixel[4];
float sp_squared_color_diff(const SPPixel px1, const SPPixel px2) {
SPPixel d;
__m128 sse_px1 = _mm_load_ps(px1);
__m128 sse_px2 = _mm_load_ps(px2);
sse_px1 = _mm_sub_ps(sse_px1, sse_px2);
sse_px2 = _mm_mul_ps(sse_px1, sse_px1); // EXC_BAD_ACCESS
_mm_store_ps(d, sse_px2);
return d[0] + d[1] + d[2] + d[3];
}
int main(int argc, const char * argv[]) {
SPPixel a __attribute__ ((aligned (16))) = {1, 2, 3, 4};
SPPixel b __attribute__ ((aligned (16))) = {2, 4, 6, 8};
float result = sp_squared_color_diff(a, b);
printf("result = %f\n", result);
return 0;
}

The local variable d is misaligned. Fix the alignment in the typedef for SPPixel rather than having to remember it on every definition.
Change:
typedef float SPPixel[4];
to:
typedef float SPPixel[4] __attribute__ ((aligned(16)));
and then you can also remove the __attribute__ ((aligned(16))) qualifiers in main.

Related

cmake finds cuda but fails to find cuda libraries on Windows

I have a small cmake project that works perfectly well on Linux but fails on Windows 10 (I tried with two different computers) with the latest versions of cmake and CUDA 8. It finds CUDA just fine, but fails to find the libraries. My cmake file:
cmake_minimum_required(VERSION 3.0)
project(myproject)
find_package(CUDA REQUIRED)
cuda_add_library(myproject STATIC matrix_mm.cu)
target_link_libraries(myproject ${CUDA_CUBLAS_LIBRARIES})
message(STATUS "")
message(STATUS "FoundCUDA : ${CUDA_FOUND}")
message(STATUS "Cuda cublas libraries : ${CUDA_CUBLAS_LIBRARIES}")
In the same folder, I have the header matrix_mm.cuh:
#include <cstdlib>
namespace myproject {
float* cuda_mm(const float *a, const float *b, const size_t m, const size_t k, const size_t n);
} /* end namespace myproject */
and matrix_mm.cu:
#include <cublas_v2.h>
#include "matrix_mm.cuh"
namespace myproject {
// Adapted from https://solarianprogrammer.com/2012/05/31/matrix-multiplication-cuda-cublas-curand-thrust/
void gpu_blas_mmul(const float *a, const float *b, float *c, const size_t m, const size_t k, const size_t n) {
int lda = m, ldb = k, ldc = m;
const float alf = 1;
const float bet = 0;
const float *alpha = &alf;
const float *beta = &bet;
// Create a handle for CUBLAS
cublasHandle_t handle;
cublasCreate(&handle);
// Do the actual multiplication
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
// Destroy the handle
cublasDestroy(handle);
}
float* cuda_mm(const float *a, const float *b, const size_t m, const size_t k, const size_t n) {
size_t const a_bytes = m * k * sizeof(float);
size_t const b_bytes = k * n * sizeof(float);
size_t const c_bytes = m * n * sizeof(float);
float* c = (float*)std::malloc(c_bytes);
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, a_bytes);
cudaMalloc(&d_B, b_bytes);
cudaMalloc(&d_C, c_bytes);
cudaMemcpy(d_A, a, a_bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, b, b_bytes, cudaMemcpyHostToDevice);
gpu_blas_mmul(d_A, d_B, d_C, m, k, n);
cudaMemcpy(c, d_C, c_bytes, cudaMemcpyDeviceToHost);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
return c;
}
} /* end namespace myproject */
On Linux I get:
-- FoundCUDA : TRUE
-- Toolkit root : /usr
-- Cuda cublas libraries : /usr/lib/x86_64-linux-gnu/libcublas.so
While on both Windows 10 machines I get
-- FoundCUDA : TRUE
-- Toolkit root : C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v8.0
-- Cuda cublas libraries : CUDA_cublas_LIBRARY-NOTFOUND;CUDA_cublas_device_LIBRARY-NOTFOUND
...and of course it fails to compile because the linker can't find cublas.
I tried quite a few things: making the lib SHARED instead of STATIC, I made sure Cuda was in Windows' environmental variables, etc, but nothing works.
This is the CMake snippet I use to find CUDA 8 on Windows 10 with CMake 3.7.1:
cmake_minimum_required(VERSION 3.7)
project(myproject)
# Check for CUDA ENV vars
IF(NOT DEFINED ENV{CUDA_PATH})
MESSAGE(FATAL_ERROR "CUDA_PATH Environment variable is not set.")
ENDIF(NOT DEFINED ENV{CUDA_PATH})
# Set the toolkit path
FILE(TO_CMAKE_PATH "$ENV{CUDA_PATH}" CUDA_TOOLKIT_ROOT_DIR)
SET(CUDA_TOOLKIT_ROOT_DIR ${CUDA_TOOLKIT_ROOT_DIR} CACHE STRING "Root directory of the Cuda Library" FORCE)
# Find the package
find_package(CUDA REQUIRED)
# Create and interface library as a link target (Requires CMake 3.7.0+)
add_library(cuda INTERFACE)
set_target_properties(cuda PROPERTIES
INTERFACE_INCLUDE_DIRECTORIES ${CUDA_INCLUDE_DIRS}
INTERFACE_LINK_LIBRARIES "${CUDA_LIBRARIES};${CUDA_CUFFT_LIBRARIES};${CUDA_CUBLAS_LIBRARIES}"
)
SET(CUDA_HOST_COMPILATION_CPP ON)
cuda_add_library(myproject STATIC matrix_mm.cu)
target_link_libraries(myproject cuda)
I think the quotes around the libraries are important since the Windows path will contain spaces.
I would also make sure that you delete your cache and regenerate the project. It's often the cause of errors when variable values appear correct on the surface (or when you make changes to a non-FORCE cache variable).

Mixing Scalar Types in Eigen

#include <iostream>
#include <Eigen/Core>
namespace Eigen {
// float op double -> double
template <typename BinaryOp>
struct ScalarBinaryOpTraits<float, double, BinaryOp> {
enum { Defined = 1 };
typedef double ReturnType;
};
// double op float -> double
template <typename BinaryOp>
struct ScalarBinaryOpTraits<double, float, BinaryOp> {
enum { Defined = 1 };
typedef double ReturnType;
};
}
int main() {
Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic> m1(2, 2);
m1 << 1, 2, 3, 4;
Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic> m2(2, 2);
m2 << 1, 2, 3, 4;
std::cerr << m1 * m2 <<std::endl; // <- boom!!
}
I'd like to know why the above code does not compile. Here is the full error messages. Please note that if I define m1 and m2 to have fixed sizes, it works fine.
I'm using Eigen3.3.1. It's tested on a Mac running OSX-10.12 with Apple's clang-800.0.42.1.
This is because the general matrix-matrix product is highly optimized with aggressive manual vectorization, pipelining, multi-level caching, etc. This part does not support mixing float and double. You can bypass this heavily optimized implementation with m1.lazyProduct(m2) that corresponds to the implementations used fro small fixed-size matrices, but there is only disadvantages of doing so: the ALUs does not support mixing float and double, so float values have to be promoted to double anyway and you will loose vectorization. Better cast the float to double explicitly:
m1.cast<double>() * m2

GCC Vector Extensions Sqrt

I am currently experimenting with the GCC vector extensions. However, I am wondering how to go about getting sqrt(vec) to work as expected.
As in:
typedef double v4d __attribute__ ((vector_size (16)));
v4d myfunc(v4d in)
{
return some_sqrt(in);
}
and at least on a recent x86 system have it emit a call to the relevant intrinsic sqrtpd. Is there a GCC builtin for sqrt that works on vector types or does one need to drop down to the intrinsic level to accomplish this?
Looks like it's a bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54408 I don't know of any workaround other than do it component-wise. The vector extensions were never meant to replace platform specific intrinsics anyway.
Some funky code to this effect:
#include <cmath>
#include <utility>
template <::std::size_t...> struct indices { };
template <::std::size_t M, ::std::size_t... Is>
struct make_indices : make_indices<M - 1, M - 1, Is...> {};
template <::std::size_t... Is>
struct make_indices<0, Is...> : indices<Is...> {};
typedef float vec_type __attribute__ ((vector_size(4 * sizeof(float))));
template <::std::size_t ...Is>
vec_type sqrt_(vec_type const& v, indices<Is...> const)
{
vec_type r;
::std::initializer_list<int>{(r[Is] = ::std::sqrt(v[Is]), 0)...};
return r;
}
vec_type sqrt(vec_type const& v)
{
return sqrt_(v, make_indices<4>());
}
int main()
{
vec_type v;
return sqrt(v)[0];
}
You could also try your luck with auto-vectorization, which is separate from the vector extension.
You can loop over the vectors directly
#include <math.h>
typedef double v2d __attribute__ ((vector_size (16)));
v2d myfunc(v2d in) {
v2d out;
for(int i=0; i<2; i++) out[i] = sqrt(in[i]);
return out;
}
The sqrt function has to trap for signed zero and NAN but if you avoid these with -Ofast both Clang and GCC produce simply sqrtpd.
https://godbolt.org/g/aCuovX
GCC might have a bug because I had to loop to 4 even though there are only 2 elements to get optimal code.
But with AVX and AVX512 GCC and Clang are ideal
AVX
https://godbolt.org/g/qdTxyp
AVX512
https://godbolt.org/g/MJP1n7
My reading of the question is that you want the square root of 4 packed double precision values... that's 32 bytes. Use the appropriate AVX intrinsic:
#include <x86intrin.h>
typedef double v4d __attribute__ ((vector_size (32)));
v4d myfunc (v4d v) {
return _mm256_sqrt_pd(v);
}
x86-64 gcc 10.2 and x86-64 clang 10.0.1
using -O3 -march=skylake :
myfunc:
vsqrtpd %ymm0, %ymm0 # (or just `ymm0` for Intel syntax)
ret
ymm0 is the return value register.
That said, it just so happens there is a builtin: __builtin_ia32_sqrtpd256, which doesn't require the intrinsics header. I would definitely discourage its use however.

Compiling GSL odeiv2 with g++

I'm attempting to compile the example code relating to the ODE solver, gsl/gsl_odeiv2, using g++. The code below is from their website :
http://www.gnu.org/software/gsl/manual/html_node/ODE-Example-programs.html
and compiles fine under gcc, but g++ throws the error
invalid conversion from 'void*' to 'int (*)(double, const double*, double*, double*,
void*)' [-fpermissive]
in the code :
#include <stdio.h>
#include <gsl/gsl_errno.h>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_odeiv2.h>
int func (double t, const double y[], double f[], void *params)
{
double mu = *(double *)params;
f[0] = y[1];
f[1] = -y[0] - mu*y[1]*(y[0]*y[0] - 1);
return GSL_SUCCESS;
}
int * jac;
int main ()
{
double mu = 10;
gsl_odeiv2_system sys = {func, jac, 2, &mu};
gsl_odeiv2_driver * d = gsl_odeiv2_driver_alloc_y_new (&sys, gsl_odeiv2_step_rkf45, 1e-6, 1e-6, 0.0);
int i;
double t = 0.0, t1 = 100.0;
double y[2] = { 1.0, 0.0 };
for (i = 1; i <= 100; i++)
{
double ti = i * t1 / 100.0;
int status = gsl_odeiv2_driver_apply (d, &t, ti, y);
if (status != GSL_SUCCESS)
{
printf ("error, return value=%d\n", status);
break;
}
printf ("%.5e %.5e %.5e\n", t, y[0], y[1]);
}
gsl_odeiv2_driver_free (d);
return 0;
}
The error is given on the line
gsl_odeiv2_system sys = {func, jac, 2, &mu};
Any help in solving this issue would be fantastic. I'm hoping to include some stdlib elements, hence wanting to compile it as C++. Also, if I can get it to compile with g++-4.7, I could more easily multithread it using C++11's additions to the language. Thank you very much.
It looks like you have some problems with Jacobian. In your particular case you could just use NULL instead of jac in the definition of your system, i.e.
gsl_odeiv2_system sys = {func, NULL, 2, &mu};
In general you Jacobian must be a function with particular entries - see gsl manual - that is why your compiler is complaining.
Also, you may want to link the gsl library manually:
-L/usr/local/lib -lgsl
if you are on a linux system.

SSE (SIMD extensions) support in gcc

I see a code as below:
#include "stdio.h"
#define VECTOR_SIZE 4
typedef float v4sf __attribute__ ((vector_size(sizeof(float)*VECTOR_SIZE)));
// vector of four single floats
typedef union f4vector
{
v4sf v;
float f[VECTOR_SIZE];
} f4vector;
void print_vector (f4vector *v)
{
printf("%f,%f,%f,%f\n", v->f[0], v->f[1], v->f[2], v->f[3]);
}
int main()
{
union f4vector a, b, c;
a.v = (v4sf){1.2, 2.3, 3.4, 4.5};
b.v = (v4sf){5., 6., 7., 8.};
c.v = a.v + b.v;
print_vector(&a);
print_vector(&b);
print_vector(&c);
}
This code builds fine and works expectedly using gcc (it's inbuild SSE / MMX extensions and vector data types. this code is doing a SIMD vector addition using 4 single floats.
I want to understand in detail what does each keyword/function call on this typedef line means and does:
typedef float v4sf __attribute__ ((vector_size(sizeof(float)*VECTOR_SIZE)));
What is the vector_size() function return;
What is the __attribute__ keyword for
Here is the float data type being type defined to vfsf type?
I understand the rest part.
thanks,
-AD
__attribute__ is GCCs way of exposing functionality from the compiler that isn't in the C or C++ standards.
__attribute__((vector_size(x))) instructs GCC to treat the type as a vector of size x. For SSE this is 16 bytes.
However, I would suggest using the __m128, __m128i or __m128d types found in the various <*mmintrin.h> headers. They are more portable across compilers.

Resources