I am currently in a parallel computing class using a book called Cuda by Example. In Chapter 4 of this book I am using some .h files that contain includes for "GL/glut.h" and "GL/glext.h", I have steps for installing GLUT online, and followed those. I think that this worked but I am not sure. I then tried to find directions for glext, but I cannot seem to find as much on this. I did find one .h file and tried to use that by including it in the GL folder as well. This does not seem to work because I received errors when compiling of things similar to this:
Error 1 error : calling a host function("cuComplex::cuComplex") from a device/_global_ function("julia") is not allowed C:\Users\Laptop\Documents\Visual Studio 2010\Projects\Lab1\Lab1\lab1.cu 29 1 Lab1
I think this is because I need more for glext.h, like .dll and things similar to the glut, but I am not sure. Any help with this would be appreciated. Thank You.
EDIT:- this is the code that I am using, and I have not changed it from what I see in the book, except for the top two include statements and the .h files are from google code: thank you for any help
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "book.h"
#include "cpu_bitmap.h"
#define DIM 1000
struct cuComplex {
float r;
float i;
cuComplex( float a, float b) : r(a), i(b) {}
__device__ float magnitude2(void) {
return r*r + i*i;
}
__device__ cuComplex operator* (const cuComplex& a) {
return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);
}
__device__ cuComplex operator+ (const cuComplex& a) {
return cuComplex(r+a.r, i+a.i);
}
};
__device__ int julia( int x, int y) {
const float scale = 1.5;
float jx = scale * (float)(DIM/2 -x)/(DIM/2);
float jy = scale * (float)(DIM/2 - y)/(DIM/2);
cuComplex c(-0.8, .156);
cuComplex a(jx, jy);
int i = 0;
for(i=0;i<200;i++) {
a = a * a + c;
if(a.magnitude2() > 1000)
return 0;
}
return 1;
}
__global__ void kernel(unsigned char *ptr ) {
//map from threadIdx/BlockIdx to pixel position
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + y * gridDim.x;
//now claculate the value at that position
int juliaValue = julia(x,y);
ptr[offset*4 + 0] = 255 * juliaValue;
ptr[offset*4 + 1] = 0;
ptr[offset*4 + 2] = 0;
ptr[offset*4 + 3] = 255;
}
int main( void ) {
CPUBitmap bitmap(DIM, DIM);
unsigned char *dev_bitmap;
HANDLE_ERROR(cudaMalloc((void**)&dev_bitmap, bitmap.image_size()));
dim3 grid(DIM,DIM);
kernel<<<grid,1>>>( dev_bitmap );
HANDLE_ERROR( cudaMemcpy( bitmap.get_ptr(), dev_bitmap, bitmap.image_size(), cudaMemcpyDeviceToHost));
bitmap.display_and_exit();
HANDLE_ERROR( cudaFree( dev_bitmap ));
}
try adding the following.
Original code:
cuComplex( float a, float b) : r(a), i(b) {}
Modified:
__host__ __device__ cuComplex( float a, float b ) : r(a), i(b) {}
It fixed the issue for me. I also didn't need the two include files you added, but you may depending on your build process.
A CUDA program consists of 2 types of code: host code and device code. Host code runs on the host CPU and cannot run on the GPU, and device code runs on the GPU and cannot run on the CPU. If you don't decorate your program in any way, then it will be all host code. But once you start adding CUDA sections delineated by keywords like __ global__ or __ device__ then your program will contain some device code.
The compiler error you received indicated that a function that was running on the device was attempting to use code compiled for the CPU. This is a no-no and the compiler will not allow this. This example is unusual since at some point in time (when the book was written) it presumably did not generate this error, and furthermore the code in cuComplex struct appears to be decorated with __ device__ keyword. However at the outermost level of the struct at the line of code I modified, there is no keyword identifying __ device__ . When I add the __ device__ __ host__ keywords, this tells the compiler "for this logical section, create both a device-compiled version and a host-compiled version of the code". This explicitly tells the compiler you want to be able to use this section of code in the device. And with that addition, we have steered the compiler correctly and it no longer gives the complaint.
Apparently something has changed about the level of decoration that the compiler needs to generate device code in this case. Presumably, with older compilers, the __ device__ keywords inside the struct were enough to let the compiler know that it had to generate device versions of the operators callable by cuComplex type.
Related
I'm sorry if this is a really stupid question, but I really need this for my master thesis, and I just can't find a way. I need to calculate the complete elliptical integral of first kind with eclipse 3.8. on an Ubuntu laptop. My compiler is set to -c -fmessage-length=0 -std=c++11.
As for the ubuntu version, it's
#laptop:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty
and for the gcc compiler, it is
laptop:~$ gcc --version
gcc (Ubuntu 4.8.5-2ubuntu1~14.04.1) 4.8.5
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I found under mathematical special functions that there is a function double comp_ellint_1( float arg ) that would do the job, but as I understand it it is only included in C++ 17, which I have not installed and where I can't find information about how to install it. But apparently there is a possibility to calculate the function without C++17? Because it says:
As all special functions, comp_ellint_1 is only guaranteed to be available in <cmath> if __STDCPP_MATH_SPEC_FUNCS__ is defined by the implementation to a value at least 201003L and if the user defines __STDCPP_WANT_MATH_SPEC_FUNCS__ before including any standard library headers.
But their example code
#define __STDCPP_WANT_MATH_SPEC_FUNCS__ 1
#include <cmath>
#include <iostream>
int main(){
double integral= std::comp_ellint_1(0);
return 0;
}
Does not work, the error being 15:22: error: ‘comp_ellint_1’ is not a member of ‘std’. I've also tried
#define _STDCPP_MATH_SPEC_FUNCS__201003L
#define __STDCPP_WANT_MATH_SPEC_FUNCS__ 1
#include <cmath>
#include <iostream>
int main(){
double integral= std::comp_ellint_1(0);
return 0;
}
which leads to the same error. It does not say if I need to install certain packages to make it work (if I do need any, which are they and how do I install them). Or am I making a different mistake?
I'd be super thankful for any ideas how to solve this, so thank you very much in advance!
Your gcc 4.8.5 had this function as std::tr1::comp_ellint_1.
You will need to #include <tr1/cmath>
This is mentioned in the cppreference page for its C++17 version
If it does not work or want to run on older versions also you can include boost. To do it at Visual Studio you should include:
#define BOOST_CONFIG_SUPPRESS_OUTDATED_MESSAGE
#include <boost/lambda/lambda.hpp>
#include <boost/math/special_functions/ellint_1.hpp>
#include <boost/math/special_functions/ellint_2.hpp>
#include <boost/math/special_functions/ellint_3.hpp>
Then:
using namespace boost::math;
double Kk = ellint_1(k);
double Ek1 = ellint_2(k) / (q - 4.*al);
To do that you should write a copy of the boost at hard disk, as example at C:\boost_1_66_0
Then by edit the project properties you should add following links:
C/C++ Directories->additional include directories: C:\boost_1_66_0
C/C++->Precompiled headers->Precompiled header-> Not use precompiled headers
Linker->general->Additional Library Directories->C:\boost_1_66_0\libs;
Another way is to include the following function that calculates both: first and second kind complete integrals. I tested it and worked well using an online tool and the ellint_1 and 2:
void Complete_Elliptic_Integrals(double x, double* Fk, double* Ek)
{
const double PI_2 = 1.5707963267948966192313216916397514; // pi/2
const double PI_4 = 0.7853981633974483096156608458198757; // pi/4
double k; // modulus
double m; // the parameter of the elliptic function m = modulus^2
double a; // arithmetic mean
double g; // geometric mean
double a_old; // previous arithmetic mean
double g_old; // previous geometric mean
double two_n; // power of 2
double sum;
if ( x == 0.0 ) { *Fk = M_PI_2; *Ek = M_PI_2; return; }
k = fabs(x);
m = k * k;
if ( m == 1.0 ) { *Fk = DBL_MAX; *Ek = 1.0; return; }
a = 1.0;
g = sqrt(1.0 - m);
two_n = 1.0;
sum = 2.0 - m;
for (int i=0;i<100;i++)
{
g_old = g;
a_old = a;
a = 0.5 * (g_old + a_old);
g = g_old * a_old;
two_n += two_n;
sum -= two_n * (a * a - g);
if ( fabs(a_old - g_old) <= (a_old * DBL_EPSILON) ) break;
g = sqrt(g);
}
*Fk = (double) (PI_2 / a);
*Ek = (double) ((PI_4 / a) * sum);
return;
}
Unfortunately it lasts double than executing ellint_1 and ellint_2
I haven't looked in ages how the computer actually starts up, so I started playing around with writing my own loader which would boot into IA-32e mode and initialize all the CPUs with some dummy code to run. I'm fairly far, but I'm getting tired of writing trivial things in assembler.
Here's a toy case of what I would like to achieve. Say I want to write a simple piece of code that prints a C-style string and keeps track of the cursor in some fixed location in memory. A C implementation would be something along the following lines (this code is untested, I wrote it on the fly, so don't comment on bugs, since they're not relevant):
#define VIDEORAM_ADDRESS 0xa0000
#define VIDEORAM_LINE_LENGTH 160
#define VGA_GREY_ON_BLACK 0x07
#define CURSOR_X 0x100 /* dummy address */
#define CURSOR_Y 0x101
void printk(const char *s)
{
volatile char *p;
int x, y;
x = *(volatile char*)CURSOR_X;
y = *(volatile char*)CURSOR_Y;
while(*s != 0) {
if(*s == '\n') {
y++;
y = y >= 25 ? 0 : y;
x = 0;
} else {
x++;
if(x >= 80) {
y++;
y = y >= 25 ? 0 : y;
x = 0;
}
p = (volatile char*)VIDEORAM_ADDRESS + x + y * VIDEORAM_LINE_LENGTH;
*p++ = *s++;
*p = VGA_GREY_ON_BLACK;
}
}
*(volatile char*)CURSOR_X = x;
*(volatile char*)CURSOR_Y = y;
}
I can compile this with gcc -m32 -O2 -S printk.c, which generates printk.s. My question is essentially how to combine this together with a handwritten assembly file? The end result should of course be nothing else except a single binary blob of machine code and data that is loaded by the BIOS onto 0000:7C00 if, say, I want to include the code into the stage 1 loader loaded from a disk and call it after switching over to protected mode.
Is an alternative putting an .include directive somewhere in the handwritten assembly file to get the code included? Unfortunately, gcc emits all kinds of directives for the GNU Assembler in the .s file and I really only want the code for the printk function.
Is there some canonical way of doing this?
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I've implemented various algorithms using Cuda, such as matrix multiplication, Cholesky decomposition and inversion (by forward substitution) of a lower triangular matrix.
For some of these algorithms I have a for loop in the kernel that repeats part of the kernel code lots of times. It all works well for (flattened: represented by 1D arrays) matrices (of floats) up to about 200x200, with the for loop calling part of the kernel code 200 times. Increasing the matrix size to say 1000x1000 (with the for loop calling part of the kernel code 1000 times) leaves the GPU to take as much computing time as can be expected based on trials with smaller matrix sizes. But no kernel code (including parts outside the for loop) seems to have been run (the output matrix has none of its elements changed since initialization). If I increase the matrix size to around 500 I'm sometimes able to get the kernel to run if I set the limiter in the for loop to some low value (such has 3).
Have I hit some hardware limit here or is there a trick I can use to make these for loops work for large matrices?
This is an example of complete code that you can copy into a .cu file. The kernel attempts to copy the contents of matrix A (W*H) to matrix B (W*H). The output shows the first element of both matrices, for W*H < 200x200 this works just fine, for W*H = 1000x1000 no copying seems to occur because the elements of B remain zero, as if nothing happened since initialization. I'm compiling and running this code on a linux based server. For large matrices error checking gives me: "GPUassert: unspecified launch failure" at line 67 which is the cudamempcy line that copies matrix B from device to host.
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <iostream>
#include <time.h>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void MatrixCopy(float *A, float *B, int W)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int j = blockIdx.y*blockDim.y + threadIdx.y;
B[j*W + i]=A[j*W + i];
}
int main(void)
{
clock_t start1=clock();
int W=1000;
int H=1000;
float *A, *B;
float *devA, *devB;
A=(float*)malloc(W*H*sizeof(float));
B=(float*)malloc(W*H*sizeof(float));
for(int i=0; i<=W*H; i++)
{
A[i]=rand() % 3;
A[i]=A[i]+1;
B[i]=0;
}
gpuErrchk( cudaMalloc( (void**)&devA, W*H*sizeof(float) ) );
gpuErrchk( cudaMalloc( (void**)&devB, W*H*sizeof(float) ) );
gpuErrchk( cudaMemcpy( devA, A, W*H*sizeof(float), cudaMemcpyHostToDevice ) );
gpuErrchk( cudaMemcpy( devB, B, W*H*sizeof(float), cudaMemcpyHostToDevice ) );
dim3 threads(32,32);
int bloW=(int)ceil((double)W/32);
int bloH=(int)ceil((double)H/32);
dim3 blocks(bloW, bloH);
clock_t finish1=clock();
clock_t start2=clock();
MatrixCopy<<<blocks,threads>>>(devA, devB, W);
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaMemcpy( B, devB, W*H*sizeof(float), cudaMemcpyDeviceToHost ) );
clock_t finish2=clock();
printf("\nGPU calculation time (ms): %d\nInitialization time (ms): %d\n\n", (int)ceil(double(((finish2-start2)*1000/(CLOCKS_PER_SEC)))), (int)ceil(double(((finish1-start1)*1000/(CLOCKS_PER_SEC)))));
printf("\n%f\n", A[0]);
printf("\n%f\n\n", B[0]);
gpuErrchk( cudaFree(devA) );
gpuErrchk( cudaFree(devB) );
free(A);
free(B);
#ifdef _WIN32
system ("PAUSE");
#endif
return 0;
}
Your kernel has no thread checking.
You are deciding the grid size (in blocks) like this:
int bloW=(int)ceil((double)W/32);
int bloH=(int)ceil((double)H/32);
For values of H and W that are not even multiples of the threads per block sizes (32) this creates extra threads and blocks, outside of the actual matrix you care about (1000x1000). There's nothing wrong with this; this is common practice.
However, we must make sure those extra threads don't actually do anything (i.e. don't generate invalid accesses to memory). Your kernel does not provide this checking.
If you modify your kernel to be something like this:
__global__ void MatrixCopy(float *A, float *B, int W, int H)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int j = blockIdx.y*blockDim.y + threadIdx.y;
if ((i < W) && (j < H))
B[j*W + i]=A[j*W + i];
}
I think you'll have better results. Without this, some of your A and B references in the kernel are generating out-of-bounds accesses, which you can see if your run your code with cuda-memcheck. And you'll have to modify the kernel invocation line to add the H parameter as well. I haven't really sorted out whether your i variable corresponds to H or W; I assume you can do that and make the change if needed. In this case, since the matrix is square, it doesn't really matter.
And you should do proper cuda error checking any time you are having trouble with CUDA code. I would suggest doing this before you post here asking for help.
I'm developing an accelerated component in OpenCL, using Xcode 4.5.1 and Grand Central Dispatch, guided by this tutorial.
The full kernel kept failing on the GPU, giving signal SIGABRT. I couldn't make much progress interpreting the error beyond that.
But I broke out aspects of the kernel to test, and I found something very peculiar involving assigning certain values to positions in an array within a loop.
Test scenario: give each thread a fixed range of array indices to initialize.
kernel void zero(size_t num_buckets, size_t positions_per_bucket, global int* array) {
size_t bucket_index = get_global_id(0);
if (bucket_index >= num_buckets) return;
for (size_t i = 0; i < positions_per_bucket; i++)
array[bucket_index * positions_per_bucket + i] = 0;
}
The above kernel fails. However, when I assign 1 instead of 0, the kernel succeeds (and my host code prints out the array of 1's). Based on a handful of tests on various integer values, I've only had problems with 0 and -1.
I've tried to outsmart the compiler with 1-1, (int) 0, etc, with no success. Passing zero in as a kernel argument worked though.
The assignment to zero does work outside of the context of a for loop:
array[bucket_index * positions_per_bucket] = 0;
The findings above were confirmed on two machines with different configurations. (OSX 10.7 + GeForce, OSX 10.8 + Radeon.) Furthermore, the kernel had no trouble when running on CL_DEVICE_TYPE_CPU -- it's just on the GPU.
Clearly, something ridiculous is happening, and it must be on my end, because "zero" can't be broken. Hopefully it's something simple. Thank you for your help.
Host code:
#include <stdio.h>
#include <OpenCL/OpenCL.h>
#include "zero.cl.h"
int main(int argc, const char* argv[]) {
dispatch_queue_t queue = gcl_create_dispatch_queue(CL_DEVICE_TYPE_GPU, NULL);
size_t num_buckets = 64;
size_t positions_per_bucket = 4;
cl_int* h_array = malloc(sizeof(cl_int) * num_buckets * positions_per_bucket);
cl_int* d_array = gcl_malloc(sizeof(cl_int) * num_buckets * positions_per_bucket, NULL, CL_MEM_WRITE_ONLY);
dispatch_sync(queue, ^{
cl_ndrange range = { 1, { 0 }, { num_buckets }, { 0 } };
zero_kernel(&range, num_buckets, positions_per_bucket, d_array);
gcl_memcpy(h_array, d_array, sizeof(cl_int) * num_buckets * positions_per_bucket);
});
for (size_t i = 0; i < num_buckets * positions_per_bucket; i++)
printf("%d ", h_array[i]);
printf("\n");
}
Refer to the OpenCL standard, section 6, paragraph 8 "Restrictions", bullet point k (emphasis mine):
6.8 k. Arguments to kernel functions in a program cannot be declared with the built-in scalar types bool, half, size_t, ptrdiff_t, intptr_t, and uintptr_t. [...]
The fact that your compiler even let you build the kernel at all indicates it is somewhat broken.
So you might want to fix that... but if that doesn't fix it, then it looks like a compiler bug, plain and simple (of CLC, that is, the OpenCL compiler, not your host code). There is no reason this kernel should work with any constant other than 0, -1. Did you try updating your OpenCL driver, what about trying on a different operating system (though I suppose this code is OS X only)?
How do I initialize device array which is allocated using cudaMalloc()?
I tried cudaMemset, but it fails to initialize all values except 0.code, for cudaMemset looks like below, where value is initialized to 5.
cudaMemset(devPtr,value,number_bytes)
As you are discovering, cudaMemset works like the C standard library memset. Quoting from the documentation:
cudaError_t cudaMemset ( void * devPtr,
int value,
size_t count
)
Fills the first count bytes of the memory area pointed to by devPtr
with the constant byte value value.
So value is a byte value. If you do something like:
int *devPtr;
cudaMalloc((void **)&devPtr,number_bytes);
const int value = 5;
cudaMemset(devPtr,value,number_bytes);
what you are asking to happen is that each byte of devPtr will be set to 5. If devPtr was a an array of integers, the result would be each integer word would have the value 84215045. This is probably not what you had in mind.
Using the runtime API, what you could do is write your own generic kernel to do this. It could be as simple as
template<typename T>
__global__ void initKernel(T * devPtr, const T val, const size_t nwords)
{
int tidx = threadIdx.x + blockDim.x * blockIdx.x;
int stride = blockDim.x * gridDim.x;
for(; tidx < nwords; tidx += stride)
devPtr[tidx] = val;
}
(standard disclaimer: written in browser, never compiled, never tested, use at own risk).
Just instantiate the template for the types you need and call it with a suitable grid and block size, paying attention to the last argument now being a word count, not a byte count as in cudaMemset. This isn't really any different to what cudaMemset does anyway, using that API call results in a kernel launch which is do too different to what I posted above.
Alternatively, if you can use the driver API, there is cuMemsetD16 and cuMemsetD32, which do the same thing, but for half and full 32 bit word types. If you need to do set 64 bit or larger types (so doubles or vector types), your best option is to use your own kernel.
I also needed a solution to this question and I didn't really understand the other proposed solution. Particularly I didn't understand why it iterates over the grid blocks for(; tidx < nwords; tidx += stride) and for that matter, the kernel invocation and why using the counter-intuitive word sizes.
Therefore I created a much simpler monolithic generic kernel and customized it with strides i.e. you may use it to initialize a matrix in multiple ways e.g. set rows or columns to any value:
template <typename T>
__global__ void kernelInitializeArray(T* __restrict__ a, const T value,
const size_t n, const size_t incx) {
int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid*incx < n) {
a[tid*incx] = value;
}
}
Then you may invoke the kernel like this:
template <typename T>
void deviceInitializeArray(T* a, const T value, const size_t n, const size_t incx) {
int number_of_blocks = ((n / incx) + BLOCK_SIZE - 1) / BLOCK_SIZE;
dim3 gridDim(number_of_blocks, 1);
dim3 blockDim(BLOCK_SIZE, 1);
kernelInitializeArray<T> <<<gridDim, blockDim>>>(a, value, n, incx);
}