cmake finds cuda but fails to find cuda libraries on Windows - windows

I have a small cmake project that works perfectly well on Linux but fails on Windows 10 (I tried with two different computers) with the latest versions of cmake and CUDA 8. It finds CUDA just fine, but fails to find the libraries. My cmake file:
cmake_minimum_required(VERSION 3.0)
project(myproject)
find_package(CUDA REQUIRED)
cuda_add_library(myproject STATIC matrix_mm.cu)
target_link_libraries(myproject ${CUDA_CUBLAS_LIBRARIES})
message(STATUS "")
message(STATUS "FoundCUDA : ${CUDA_FOUND}")
message(STATUS "Cuda cublas libraries : ${CUDA_CUBLAS_LIBRARIES}")
In the same folder, I have the header matrix_mm.cuh:
#include <cstdlib>
namespace myproject {
float* cuda_mm(const float *a, const float *b, const size_t m, const size_t k, const size_t n);
} /* end namespace myproject */
and matrix_mm.cu:
#include <cublas_v2.h>
#include "matrix_mm.cuh"
namespace myproject {
// Adapted from https://solarianprogrammer.com/2012/05/31/matrix-multiplication-cuda-cublas-curand-thrust/
void gpu_blas_mmul(const float *a, const float *b, float *c, const size_t m, const size_t k, const size_t n) {
int lda = m, ldb = k, ldc = m;
const float alf = 1;
const float bet = 0;
const float *alpha = &alf;
const float *beta = &bet;
// Create a handle for CUBLAS
cublasHandle_t handle;
cublasCreate(&handle);
// Do the actual multiplication
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc);
// Destroy the handle
cublasDestroy(handle);
}
float* cuda_mm(const float *a, const float *b, const size_t m, const size_t k, const size_t n) {
size_t const a_bytes = m * k * sizeof(float);
size_t const b_bytes = k * n * sizeof(float);
size_t const c_bytes = m * n * sizeof(float);
float* c = (float*)std::malloc(c_bytes);
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, a_bytes);
cudaMalloc(&d_B, b_bytes);
cudaMalloc(&d_C, c_bytes);
cudaMemcpy(d_A, a, a_bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, b, b_bytes, cudaMemcpyHostToDevice);
gpu_blas_mmul(d_A, d_B, d_C, m, k, n);
cudaMemcpy(c, d_C, c_bytes, cudaMemcpyDeviceToHost);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
return c;
}
} /* end namespace myproject */
On Linux I get:
-- FoundCUDA : TRUE
-- Toolkit root : /usr
-- Cuda cublas libraries : /usr/lib/x86_64-linux-gnu/libcublas.so
While on both Windows 10 machines I get
-- FoundCUDA : TRUE
-- Toolkit root : C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v8.0
-- Cuda cublas libraries : CUDA_cublas_LIBRARY-NOTFOUND;CUDA_cublas_device_LIBRARY-NOTFOUND
...and of course it fails to compile because the linker can't find cublas.
I tried quite a few things: making the lib SHARED instead of STATIC, I made sure Cuda was in Windows' environmental variables, etc, but nothing works.

This is the CMake snippet I use to find CUDA 8 on Windows 10 with CMake 3.7.1:
cmake_minimum_required(VERSION 3.7)
project(myproject)
# Check for CUDA ENV vars
IF(NOT DEFINED ENV{CUDA_PATH})
MESSAGE(FATAL_ERROR "CUDA_PATH Environment variable is not set.")
ENDIF(NOT DEFINED ENV{CUDA_PATH})
# Set the toolkit path
FILE(TO_CMAKE_PATH "$ENV{CUDA_PATH}" CUDA_TOOLKIT_ROOT_DIR)
SET(CUDA_TOOLKIT_ROOT_DIR ${CUDA_TOOLKIT_ROOT_DIR} CACHE STRING "Root directory of the Cuda Library" FORCE)
# Find the package
find_package(CUDA REQUIRED)
# Create and interface library as a link target (Requires CMake 3.7.0+)
add_library(cuda INTERFACE)
set_target_properties(cuda PROPERTIES
INTERFACE_INCLUDE_DIRECTORIES ${CUDA_INCLUDE_DIRS}
INTERFACE_LINK_LIBRARIES "${CUDA_LIBRARIES};${CUDA_CUFFT_LIBRARIES};${CUDA_CUBLAS_LIBRARIES}"
)
SET(CUDA_HOST_COMPILATION_CPP ON)
cuda_add_library(myproject STATIC matrix_mm.cu)
target_link_libraries(myproject cuda)
I think the quotes around the libraries are important since the Windows path will contain spaces.
I would also make sure that you delete your cache and regenerate the project. It's often the cause of errors when variable values appear correct on the surface (or when you make changes to a non-FORCE cache variable).

Related

How to calculate comp_ellint_1(0) on a c++11 compiler

I'm sorry if this is a really stupid question, but I really need this for my master thesis, and I just can't find a way. I need to calculate the complete elliptical integral of first kind with eclipse 3.8. on an Ubuntu laptop. My compiler is set to -c -fmessage-length=0 -std=c++11.
As for the ubuntu version, it's
#laptop:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty
and for the gcc compiler, it is
laptop:~$ gcc --version
gcc (Ubuntu 4.8.5-2ubuntu1~14.04.1) 4.8.5
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I found under mathematical special functions that there is a function double comp_ellint_1( float arg ) that would do the job, but as I understand it it is only included in C++ 17, which I have not installed and where I can't find information about how to install it. But apparently there is a possibility to calculate the function without C++17? Because it says:
As all special functions, comp_ellint_1 is only guaranteed to be available in <cmath> if __STDCPP_MATH_SPEC_FUNCS__ is defined by the implementation to a value at least 201003L and if the user defines __STDCPP_WANT_MATH_SPEC_FUNCS__ before including any standard library headers.
But their example code
#define __STDCPP_WANT_MATH_SPEC_FUNCS__ 1
#include <cmath>
#include <iostream>
int main(){
double integral= std::comp_ellint_1(0);
return 0;
}
Does not work, the error being 15:22: error: ‘comp_ellint_1’ is not a member of ‘std’. I've also tried
#define _STDCPP_MATH_SPEC_FUNCS__201003L
#define __STDCPP_WANT_MATH_SPEC_FUNCS__ 1
#include <cmath>
#include <iostream>
int main(){
double integral= std::comp_ellint_1(0);
return 0;
}
which leads to the same error. It does not say if I need to install certain packages to make it work (if I do need any, which are they and how do I install them). Or am I making a different mistake?
I'd be super thankful for any ideas how to solve this, so thank you very much in advance!
Your gcc 4.8.5 had this function as std::tr1::comp_ellint_1.
You will need to #include <tr1/cmath>
This is mentioned in the cppreference page for its C++17 version
If it does not work or want to run on older versions also you can include boost. To do it at Visual Studio you should include:
#define BOOST_CONFIG_SUPPRESS_OUTDATED_MESSAGE
#include <boost/lambda/lambda.hpp>
#include <boost/math/special_functions/ellint_1.hpp>
#include <boost/math/special_functions/ellint_2.hpp>
#include <boost/math/special_functions/ellint_3.hpp>
Then:
using namespace boost::math;
double Kk = ellint_1(k);
double Ek1 = ellint_2(k) / (q - 4.*al);
To do that you should write a copy of the boost at hard disk, as example at C:\boost_1_66_0
Then by edit the project properties you should add following links:
C/C++ Directories->additional include directories: C:\boost_1_66_0
C/C++->Precompiled headers->Precompiled header-> Not use precompiled headers
Linker->general->Additional Library Directories->C:\boost_1_66_0\libs;
Another way is to include the following function that calculates both: first and second kind complete integrals. I tested it and worked well using an online tool and the ellint_1 and 2:
void Complete_Elliptic_Integrals(double x, double* Fk, double* Ek)
{
const double PI_2 = 1.5707963267948966192313216916397514; // pi/2
const double PI_4 = 0.7853981633974483096156608458198757; // pi/4
double k; // modulus
double m; // the parameter of the elliptic function m = modulus^2
double a; // arithmetic mean
double g; // geometric mean
double a_old; // previous arithmetic mean
double g_old; // previous geometric mean
double two_n; // power of 2
double sum;
if ( x == 0.0 ) { *Fk = M_PI_2; *Ek = M_PI_2; return; }
k = fabs(x);
m = k * k;
if ( m == 1.0 ) { *Fk = DBL_MAX; *Ek = 1.0; return; }
a = 1.0;
g = sqrt(1.0 - m);
two_n = 1.0;
sum = 2.0 - m;
for (int i=0;i<100;i++)
{
g_old = g;
a_old = a;
a = 0.5 * (g_old + a_old);
g = g_old * a_old;
two_n += two_n;
sum -= two_n * (a * a - g);
if ( fabs(a_old - g_old) <= (a_old * DBL_EPSILON) ) break;
g = sqrt(g);
}
*Fk = (double) (PI_2 / a);
*Ek = (double) ((PI_4 / a) * sum);
return;
}
Unfortunately it lasts double than executing ellint_1 and ellint_2

how do i include sm_11_atomic_function.h? [duplicate]

I'm having a issue with my kernel.cu class
Calling nvcc -v kernel.cu -o kernel.o I'm getting this error:
kernel.cu(17): error: identifier "atomicAdd" is undefined
My code:
#include "dot.h"
#include <cuda.h>
#include "device_functions.h" //might call atomicAdd
__global__ void dot (int *a, int *b, int *c){
__shared__ int temp[THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
__syncthreads();
if( 0 == threadIdx.x ){
int sum = 0;
for( int i = 0; i<THREADS_PER_BLOCK; i++)
sum += temp[i];
atomicAdd(c, sum);
}
}
Some suggest?
You need to specify an architecture to nvcc which supports atomic memory operations (the default architecture is 1.0 which does not support atomics). Try:
nvcc -arch=sm_11 -v kernel.cu -o kernel.o
and see what happens.
EDIT in 2015 to note that the default architecture in CUDA 7.0 is now 2.0, which supports atomic memory operations, so this should not be a problem in newer toolkit versions.
Today with the latest cuda SDK and toolkit this solution will not work.
People also say that adding:
compute_11,sm_11; OR compute_12,sm_12; OR compute_13,sm_13;
compute_20,sm_20;
compute_30,sm_30;
to CUDA in the Project Properties in Visual Studio 2010 will work. It doesn't.
You have to specify this for the .cu file itself in its own properties (Under the C++/CUDA->Device->Code Generation) tab such as:
compute_13,sm_13;
compute_20,sm_20;
compute_30,sm_30;

Compiling GSL odeiv2 with g++

I'm attempting to compile the example code relating to the ODE solver, gsl/gsl_odeiv2, using g++. The code below is from their website :
http://www.gnu.org/software/gsl/manual/html_node/ODE-Example-programs.html
and compiles fine under gcc, but g++ throws the error
invalid conversion from 'void*' to 'int (*)(double, const double*, double*, double*,
void*)' [-fpermissive]
in the code :
#include <stdio.h>
#include <gsl/gsl_errno.h>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_odeiv2.h>
int func (double t, const double y[], double f[], void *params)
{
double mu = *(double *)params;
f[0] = y[1];
f[1] = -y[0] - mu*y[1]*(y[0]*y[0] - 1);
return GSL_SUCCESS;
}
int * jac;
int main ()
{
double mu = 10;
gsl_odeiv2_system sys = {func, jac, 2, &mu};
gsl_odeiv2_driver * d = gsl_odeiv2_driver_alloc_y_new (&sys, gsl_odeiv2_step_rkf45, 1e-6, 1e-6, 0.0);
int i;
double t = 0.0, t1 = 100.0;
double y[2] = { 1.0, 0.0 };
for (i = 1; i <= 100; i++)
{
double ti = i * t1 / 100.0;
int status = gsl_odeiv2_driver_apply (d, &t, ti, y);
if (status != GSL_SUCCESS)
{
printf ("error, return value=%d\n", status);
break;
}
printf ("%.5e %.5e %.5e\n", t, y[0], y[1]);
}
gsl_odeiv2_driver_free (d);
return 0;
}
The error is given on the line
gsl_odeiv2_system sys = {func, jac, 2, &mu};
Any help in solving this issue would be fantastic. I'm hoping to include some stdlib elements, hence wanting to compile it as C++. Also, if I can get it to compile with g++-4.7, I could more easily multithread it using C++11's additions to the language. Thank you very much.
It looks like you have some problems with Jacobian. In your particular case you could just use NULL instead of jac in the definition of your system, i.e.
gsl_odeiv2_system sys = {func, NULL, 2, &mu};
In general you Jacobian must be a function with particular entries - see gsl manual - that is why your compiler is complaining.
Also, you may want to link the gsl library manually:
-L/usr/local/lib -lgsl
if you are on a linux system.

SSE2: _mm_mul_ps fails on OS X in case of GCC 4.2 and O0 optimization

I am trying to calculate squared Euclidean distance between two 4d float vectors using SSE2. My os is Mac OS X 10.7 Lion.
When I use Apple LLVM compiler in XCode 4.5.2 everything is fine. But when I switch into GCC 4.2 in project's settings I have error EXC_BAD_ACCESS at _mm_mul_ps operation.
When I compile code from command line (g++ main.cpp) without additional arguments I have "Segmentation fault". But when I enable any optimization level (O1, O2, O3, Os) except O0 everything works.
I can not reproduce this issue on my Ubuntu 12.04 with GCC 4.6.3.
#include <stdio.h>
#include <emmintrin.h>
typedef float SPPixel[4];
float sp_squared_color_diff(const SPPixel px1, const SPPixel px2) {
SPPixel d;
__m128 sse_px1 = _mm_load_ps(px1);
__m128 sse_px2 = _mm_load_ps(px2);
sse_px1 = _mm_sub_ps(sse_px1, sse_px2);
sse_px2 = _mm_mul_ps(sse_px1, sse_px1); // EXC_BAD_ACCESS
_mm_store_ps(d, sse_px2);
return d[0] + d[1] + d[2] + d[3];
}
int main(int argc, const char * argv[]) {
SPPixel a __attribute__ ((aligned (16))) = {1, 2, 3, 4};
SPPixel b __attribute__ ((aligned (16))) = {2, 4, 6, 8};
float result = sp_squared_color_diff(a, b);
printf("result = %f\n", result);
return 0;
}
The local variable d is misaligned. Fix the alignment in the typedef for SPPixel rather than having to remember it on every definition.
Change:
typedef float SPPixel[4];
to:
typedef float SPPixel[4] __attribute__ ((aligned(16)));
and then you can also remove the __attribute__ ((aligned(16))) qualifiers in main.

glext visual studio cuda

I am currently in a parallel computing class using a book called Cuda by Example. In Chapter 4 of this book I am using some .h files that contain includes for "GL/glut.h" and "GL/glext.h", I have steps for installing GLUT online, and followed those. I think that this worked but I am not sure. I then tried to find directions for glext, but I cannot seem to find as much on this. I did find one .h file and tried to use that by including it in the GL folder as well. This does not seem to work because I received errors when compiling of things similar to this:
Error 1 error : calling a host function("cuComplex::cuComplex") from a device/_global_ function("julia") is not allowed C:\Users\Laptop\Documents\Visual Studio 2010\Projects\Lab1\Lab1\lab1.cu 29 1 Lab1
I think this is because I need more for glext.h, like .dll and things similar to the glut, but I am not sure. Any help with this would be appreciated. Thank You.
EDIT:- this is the code that I am using, and I have not changed it from what I see in the book, except for the top two include statements and the .h files are from google code: thank you for any help
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "book.h"
#include "cpu_bitmap.h"
#define DIM 1000
struct cuComplex {
float r;
float i;
cuComplex( float a, float b) : r(a), i(b) {}
__device__ float magnitude2(void) {
return r*r + i*i;
}
__device__ cuComplex operator* (const cuComplex& a) {
return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);
}
__device__ cuComplex operator+ (const cuComplex& a) {
return cuComplex(r+a.r, i+a.i);
}
};
__device__ int julia( int x, int y) {
const float scale = 1.5;
float jx = scale * (float)(DIM/2 -x)/(DIM/2);
float jy = scale * (float)(DIM/2 - y)/(DIM/2);
cuComplex c(-0.8, .156);
cuComplex a(jx, jy);
int i = 0;
for(i=0;i<200;i++) {
a = a * a + c;
if(a.magnitude2() > 1000)
return 0;
}
return 1;
}
__global__ void kernel(unsigned char *ptr ) {
//map from threadIdx/BlockIdx to pixel position
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + y * gridDim.x;
//now claculate the value at that position
int juliaValue = julia(x,y);
ptr[offset*4 + 0] = 255 * juliaValue;
ptr[offset*4 + 1] = 0;
ptr[offset*4 + 2] = 0;
ptr[offset*4 + 3] = 255;
}
int main( void ) {
CPUBitmap bitmap(DIM, DIM);
unsigned char *dev_bitmap;
HANDLE_ERROR(cudaMalloc((void**)&dev_bitmap, bitmap.image_size()));
dim3 grid(DIM,DIM);
kernel<<<grid,1>>>( dev_bitmap );
HANDLE_ERROR( cudaMemcpy( bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost));
bitmap.display_and_exit();
HANDLE_ERROR( cudaFree( dev_bitmap ));
}
try adding the following.
Original code:
cuComplex( float a, float b) : r(a), i(b) {}
Modified:
__host__ __device__ cuComplex( float a, float b ) : r(a), i(b) {}
It fixed the issue for me. I also didn't need the two include files you added, but you may depending on your build process.
A CUDA program consists of 2 types of code: host code and device code. Host code runs on the host CPU and cannot run on the GPU, and device code runs on the GPU and cannot run on the CPU. If you don't decorate your program in any way, then it will be all host code. But once you start adding CUDA sections delineated by keywords like __ global__ or __ device__ then your program will contain some device code.
The compiler error you received indicated that a function that was running on the device was attempting to use code compiled for the CPU. This is a no-no and the compiler will not allow this. This example is unusual since at some point in time (when the book was written) it presumably did not generate this error, and furthermore the code in cuComplex struct appears to be decorated with __ device__ keyword. However at the outermost level of the struct at the line of code I modified, there is no keyword identifying __ device__ . When I add the __ device__ __ host__ keywords, this tells the compiler "for this logical section, create both a device-compiled version and a host-compiled version of the code". This explicitly tells the compiler you want to be able to use this section of code in the device. And with that addition, we have steered the compiler correctly and it no longer gives the complaint.
Apparently something has changed about the level of decoration that the compiler needs to generate device code in this case. Presumably, with older compilers, the __ device__ keywords inside the struct were enough to let the compiler know that it had to generate device versions of the operators callable by cuComplex type.

Resources