For some reason, NVCC is crashing when trying to compile a GPU program with very long double-precision arithmetic expressions, of the form
// given double precision arrays A[ ], F[ ],
__global__ myKernel(double *A, double *F, long n){
//.......thread ids
A[t+1] = A[t]*F[t+1] + ...... (order of million of terms).... + A[t-1]*F[t]
}
The same code does get compiled successfully with GCC (around 30 minutes compiling) and even executes correctly.
The crash occurs after +20hrs of compilation time, with the following error:
nvcc error : 'cicc' died due to signal 11 (Invalid memory reference)
nvcc error : 'cicc' core dumped
make: *** [Makefile:28: obj/EquationAlfa.cu.o] Error 139
As a side note, if we change the GPU program to 32-bit float, then it does compile correctly, although still taking hours to compile.
The compilation line is this:
nvcc -std=c++14 -c -o obj/EquationAlfa.cu.o src/EquationAlfa.cu
NVCC version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Nov_30_19:08:53_PST_2020
Cuda compilation tools, release 11.2, V11.2.67
Build cuda_11.2.r11.2/compiler.29373293_0
During compilation, the RAM usage does not exceed 8~10GB, and the computer has 128GB of RAM.
Any help or directions would be greatly appreciated.
Related
I'm trying to write a very simple example with OpenMP using GCC 12 as compiler and the target clause in order to detect all my devices, but it does not work. First of all, I have a Lenovo W540 running Debian Sid with an integrated Inted HD Graphics 4600 (which is the GPU the GUI -plasma- is using). Also, there is a NVIDIA GK106GLM [Quadro K2100M]. I have installed the xserver-xorg-video-nvidia-legacy-390xx, nvidia-legacy-390xx-driver, and nvidia-legacy-390xx-kernel-dkms packages, as they are the corresponding drivers for my GPU. Running the lspci -k | grep -A 2 -i "VGA" command I obtain
00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller (rev 06)
Subsystem: Lenovo 4th Gen Core Processor Integrated Graphics Controller
Kernel driver in use: i915
--
01:00.0 VGA compatible controller: NVIDIA Corporation GK106GLM [Quadro K2100M] (rev a1)
Subsystem: Lenovo GK106GLM [Quadro K2100M]
Kernel modules: nvidia
so I suppose that the NVDIA GPU is well configured and detected. Am I right?
Then, I borrowed this simply code from https://enccs.github.io/openmp-gpu/target/
/* Copyright (c) 2019 CSC Training */
/* Copyright (c) 2021 ENCCS */
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#endif
int main()
{
int num_devices = omp_get_num_devices();
printf("Number of available devices %d\n", num_devices);
#pragma omp target
{
if (omp_is_initial_device()) {
printf("Running on host\n");
} else {
int nteams= omp_get_num_teams();
int nthreads= omp_get_num_threads();
printf("Running on device with %d teams in total and %d threads in each team\n",nteams,nthreads);
}
}
}
Compilation with GCC as gcc -Wall -fopenmp example.c -o example works fine and does not produce warnings nor errors, but at execution time I obtain simply
Number of available devices 0
Running on host
so not Inter nor NVIDIA GPUs were detected. Inspecting the Debian repositories I've seen and installed the packages gcc-offload-nvptx, gcc-12-offload-nvptx, libgomp-plugin-nvptx1, and nvptx-tools. Now, when I compile again as gcc -Wall -fopenmp example.c -o example I obtain a warning:
/usr/bin/ld: /tmp/ccrWP8gn.crtoffloadtable.o: warning: relocation against `__offload_vars_end' in read-only section `.rodata'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
When I execute the code I obtain again
Number of available devices 0
Running on host
but in this case the execution is terrubly slow (various seconds of time).
Searching the web I've seen that I must add the option -foffload=nvptx-none to the compilation order, but I obtain the same results as previously (this option only is recognized if gcc-offload-nvptx et al. are installed.
Running gcc -v I can see that GCC 12 in Debian is configured for offloading:
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
I don't know which number is my NVIDIA card, but I've tried export OFFLOAD_TARGET_DEFAULT=0 and export OFFLOAD_TARGET_DEFAULT=2 with the same wrong result of
Number of available devices 0
Running on host
So, how can I run my OpenMP code in the GPU?
Thanks
I have a sample code with memory leaks. Though clang displays the leak correctly, I am unable to achieve the same with gcc. gcc version I am using is 4.8.5-39
Code:
#include <stdlib.h>
void *p;
int main() {
p = malloc(7);
p = 0; // The memory is leaked here.
return 0;
}
CLANG:
clang -fsanitize=address -g memory-leak.c ; ASAN_OPTIONS=detect_leaks=1 ./a.out
=================================================================
==15543==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 7 byte(s) in 1 object(s) allocated from:
#0 0x465289 in __interceptor_malloc (/u/optest/a.out+0x465289)
#1 0x47b549 in main /u/optest/memory-leak.c:4
#2 0x7f773fe14544 in __libc_start_main /usr/src/debug/glibc-2.17-c758a686/csu/../csu/libc-start.c:266
SUMMARY: AddressSanitizer: 7 byte(s) leaked in 1 allocation(s).
GCC:
gcc -fsanitize=address -g memory-leak.c ; ASAN_OPTIONS=detect_leaks=1 ./a.out
I need to use gcc. Can someone help me understand why gcc isn't behaving like clang and what should I do to make it work.
Can someone help me understand why gcc isn't behaving like clang
Gcc-4.8 was released on March 22, 2013.
The patch to support -fsanitize=leak was sent on Nov. 15, 2013, and is likely not backported into the 4.8 release.
GCC-8.3 has no trouble detecting this leak:
$ gcc -g -fsanitize=address memory-leak.c && ./a.out
=================================================================
==166614==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 7 byte(s) in 1 object(s) allocated from:
#0 0x7fcaf3dc5330 in __interceptor_malloc (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xe9330)
#1 0x55d297afc162 in main /tmp/memory-leak.c:4
#2 0x7fcaf394052a in __libc_start_main ../csu/libc-start.c:308
SUMMARY: AddressSanitizer: 7 byte(s) leaked in 1 allocation(s).
This item in Asan's FAQ is relevant:
Q: Why didn't ASan report an obviously invalid
memory access in my code?
A1: If your errors is too obvious, compiler might have
already optimized it out by the time Asan runs.
When I compile the following code containing the design C++11, in Windows7x64 (MSVS2012 + Nsight 2.0 + CUDA5.5), then I do not get errors, and everything compiles and works well:
#include <thrust/device_vector.h>
int main() {
thrust::device_vector<int> dv(10);
auto iter = dv.begin();
return 0;
}
But when I try to compile it under the Linux64 (Debian 7 Wheezey + Nsight Eclipse from CUDA5.5), I get errors:
../src/CudaCpp11.cu(5): error: explicit type is missing ("int"
assumed)
../src/CudaCpp11.cu(5): error: no suitable conversion function from
"thrust::detail::normal_iterator>" to "int"
exists
2 errors detected in the compilation of
"/tmp/tmpxft_00001520_00000000-6_CudaCpp11.cpp1.ii". make: *
[src/CudaCpp11.o] Error 2
When I added line:-stdc++11
in Properties-> Build-> Settings-> Tool Settings-> Build Stages-> Preprocessor options (-Xcompiler)
I get more errors:
/usr/lib/gcc/x86_64-linux-gnu/4.8/include/stddef.h(432): error:
identifier "nullptr" is undefined
/usr/lib/gcc/x86_64-linux-gnu/4.8/include/stddef.h(432): error:
expected a ";"
...
/usr/include/c++/4.8/bits/cpp_type_traits.h(314): error: namespace
"std::__gnu_cxx" has no member
"__normal_iterator"
/usr/include/c++/4.8/bits/cpp_type_traits.h(314): error: expected a
">"
nvcc error : 'cudafe' died due to signal 11 (Invalid memory
reference) make: * [src/CudaCpp11.o] Error 11
Only when I use thrust::device_vector<int>::iterator iter = dv.begin(); in Linux-GCC then I do not get an error. But in Windows MSVS2012 all c++11 features works fine!
Can I use C++11 in the .cu-files (CUDA5.5) in Windows7x64 (MSVC) and Linux64 (GCC4.8.2)?
You will probably have to split the main.cpp from your others.cu like this:
others.hpp:
void others();
others.cu:
#include "others.hpp"
#include <boost/typeof/std/utility.hpp>
#include <thrust/device_vector.h>
void others() {
thrust::device_vector<int> dv(10);
BOOST_AUTO(iter, dv.begin()); // regular C++
}
main.cpp:
#include "others.hpp"
int main() {
others();
return 0;
}
This particular answer shows that compiling with an officially supported gcc version (as Robert Crovella stated correctly) should work out at least for c++11 code in the main.cpp file:
g++ -std=c++0x -c main.cpp
nvcc -arch=sm_20 -c others.cu
nvcc -lcudart -o test main.o others.o
(tested on Debian 8 with nvcc 5.5 and gcc 4.7.3).
To answer your underlying question: I am not aware that one can use C++11 in .cu files with CUDA 5.5 in Linux (and I was not aware the shown example with host-side C++11 gets properly de-cluttered under MSVC). I even filed a feature request for constexpr support which is still open.
The CUDA programming guide for CUDA 5.5 states:
For the host code, nvcc supports whatever part of the C++ ISO/IEC
14882:2003 specification the host c++ compiler supports.
For the device code, nvcc supports the features illustrated in Code
Samples with some restrictions described in Restrictions; it does not
support run time type information (RTTI), exception handling, and the
C++ Standard Library.
Anyway, it is possible to use some of the C++11 features like auto in kernels, e.g. with boost::auto.
As an outlook, other C++11 features like threads may be quite unlikely to end up in CUDA and I heard no official plans about them yet (as of supercomputing 2013).
Shameless plug: If you are interested in more of these tweeks, feel free to have a look in our library libPMacc which provides multi-GPU grid and particle abstractions for simulations. We implemented lambda, a STL-like access concept for 1-3D matrices and other useful stuff there.
All the best,
Axel
Update: Since CUDA 7.0 C++11 support in kernels has been added officially. As BenC pointed our correctly, parts of this feature were already silently added in CUDA 6.5.
According to Jared Hoberock (Thrust developer), it seems that C++11 support has been added to CUDA 6.5 (although it is still experimental and undocumented). This may make things easier when starting to use C++11 in very large C++/CUDA projects, since splitting everything can be quite cumbersome for large projects when you use CMake for instance.
In one application, I've got a bunch of CUDA kernels. Some use dynamic parallelism and some don't. For the purposes of either providing a fallback option if this is not supported, or simply allowing the application to continue but with reduced/partially available features, how can I go about compiling?
At the moment I'm getting invalid device function when running kernels compiled with -arch=sm_35 on a 670 (max sm_30) that don't require compute 3.5.
AFAIK you can't use multiple -arch=sm_* arguments and using multiple -gencode=* doesn't help. Also for separable compilation I've had to create an additional object file using -dlink, but this doesn't get created when using compute 3.0 (nvlink fatal : no candidate found in fatbinary due to -lcudadevrt, which I've needed for 3.5), how should I deal with this?
I believe this issue has been addressed now in CUDA 6.
Here's my simple test:
$ cat t264.cu
#include <stdio.h>
__global__ void kernel1(){
printf("Hello from DP Kernel\n");
}
__global__ void kernel2(){
#if __CUDA_ARCH__ >= 350
kernel1<<<1,1>>>();
#else
printf("Hello from non-DP Kernel\n");
#endif
}
int main(){
kernel2<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
$ nvcc -O3 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_35,code=sm_35 -rdc=true -o t264 t264.cu -lcudadevrt
$ CUDA_VISIBLE_DEVICES="0" ./t264
Hello from non-DP Kernel
$ CUDA_VISIBLE_DEVICES="1" ./t264
Hello from DP Kernel
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1
$
In my case, device 0 is a Quadro5000, a cc 2.0 device, and device 1 is a GeForce GT 640, a cc 3.5 device.
I don't believe there is a way to do this using the runtime API as of CUDA 5.5.
The only way I can think of to get around the problem is to use the driver API to perform your own architecture selection and load code from different cubin files at runtime. The APIs can be safely mixed, so it is only the context establishment-device selection-module load phase which needs to be done with the driver API. You can use the runtime API after that - you will need a little bit of homemade syntactic sugar for the kernel launches, but otherwise no code changes are required in other runtime API code.
Recently I'm trying to modify GCC and gcov to collect execution sequence for program. As we all know, gcc will instrument code on arcs between basic blocks to count the execution count of arc. So I instrument a function on the arc, and the function will print out the no of that arc, so I can collect program execution sequence. It works well for c program on x86 and x86_64, also for c++ program of x86. But for c++ program on x86_64, the program will crash by segment error. The compilation has no problem. The os that I use is CentOS 6.4. Version of gcc is 3.4.5. Does anybody has some advice?
sample program:
#include <iostream> using namespace std; int main(){cout<<"hello world"<<endl;}
If I compile the program in x86_64 mode. The program crash by Segment Error when comes to the cout CALL.
Ok, by another night debug on it. I found that the function emit_library_call will only generate asm code to invoke my function, but not protect the context (registers). So function call before or after the emitted code may fail due to nonuniform context. And x86_64 asm use different registers with x86. So to work well on x86 platform may be just accident. I need a function api which can emit library function call and also protect the context. Maybe I should write another emit_library_call.
Perhaps you might try a dynamic binary translation framework, e.g. DynamoRIO or Pin. These tools offer more flexibility than you need, but they would allow you do inject code at the beginning/end of each basic block. What you then want to do is save/restore the flags and registers (and potentially re-align the stack), and call out to a function. DynamoRIO has similar functionality built in, named a "clean call". I think Pin also enables this with a potentially higher-level interface.
I did same thing what you did in 3.5.0-23-generic #35~precise1-Ubuntu SMP Fri Jan 25 17:13:26 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
#include <iostream>
`using namespace std;
int main()
{
cout<<"hello world"<<endl;
}`
compiled above code with g++ -ftest-coverage -fprofile-arcs hello.cpp -o hello
hello.gcno file is generated.
After executing ./hello hello.gcda file generated .
So once check your gcc version .
My gcc version is gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)