Is my OpenACC code compiled with GCC running on GPU? - openacc

I'm trying to write a very simple example with OpenACC using GCC as compiler, but I'm not sure if the program runs on the GPU. First of all, I have a Lenovo W540 running Debian Sid with an integrated Inted HD Graphics 4600 (which is the GPU the GUI -plasma- is using). Also, there is a NVIDIA GK106GLM [Quadro K2100M]. I have installed the xserver-xorg-video-nvidia-legacy-390xx, nvidia-legacy-390xx-driver, and nvidia-legacy-390xx-kernel-dkms packages, as they are the corresponding drivers for my GPU. Running the lspci -k | grep -A 2 -i "VGA" command I obtain
00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller (rev 06)
Subsystem: Lenovo 4th Gen Core Processor Integrated Graphics Controller
Kernel driver in use: i915
--
01:00.0 VGA compatible controller: NVIDIA Corporation GK106GLM [Quadro K2100M] (rev a1)
Subsystem: Lenovo GK106GLM [Quadro K2100M]
Kernel modules: nvidia
so I suppose that the NVDIA GPU is well configured and detected. Am I right?
On the other hand, I've installed GCC 12 and also the packages gcc-offload-nvptx. First question: is it neccessary such package in order to run programs on the GPU?
Well, I wrote this very simple code
#include <stdio.h>
#ifdef _OPENACC
#include <openacc.h>
#endif
#define N 10
int main()
{
int i=0,buf[N];
#ifdef _OPENACC
printf("Number of devices: %d\n",acc_get_num_devices(acc_device_nvidia));
#endif
#pragma acc parallel loop copyout(buf[0:N])
for(i=0;i<N;i++)
{
buf[i] = i;
}
for(i=0;i<N;i++)
{
printf("%d ",buf[i]);
}
printf("\n");
return 0;
}
If I compile as gcc -Wall -fopenacc example.c -o example I obtain a warning:
/usr/bin/ld: /tmp/cckzNCnl.crtoffloadtable.o: warning: relocation against `__offload_vars_end' in read-only section `.rodata'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
but I can't interpret it. If I run the code, it works, but is very, ver slow and I obtain
Number of devices: 0
0 1 2 3 4 5 6 7 8 9
The result is correct and I could think that it was executed on the GPU as it is extremely slow (maybe due to data copy?), but I obtain the result Number of devices: 0
Can someone gime me any hint about how to compile and run the code on the GPU?

Related

Simple OpenMP program for detecting GPU and other devices does not work

I'm trying to write a very simple example with OpenMP using GCC 12 as compiler and the target clause in order to detect all my devices, but it does not work. First of all, I have a Lenovo W540 running Debian Sid with an integrated Inted HD Graphics 4600 (which is the GPU the GUI -plasma- is using). Also, there is a NVIDIA GK106GLM [Quadro K2100M]. I have installed the xserver-xorg-video-nvidia-legacy-390xx, nvidia-legacy-390xx-driver, and nvidia-legacy-390xx-kernel-dkms packages, as they are the corresponding drivers for my GPU. Running the lspci -k | grep -A 2 -i "VGA" command I obtain
00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller (rev 06)
Subsystem: Lenovo 4th Gen Core Processor Integrated Graphics Controller
Kernel driver in use: i915
--
01:00.0 VGA compatible controller: NVIDIA Corporation GK106GLM [Quadro K2100M] (rev a1)
Subsystem: Lenovo GK106GLM [Quadro K2100M]
Kernel modules: nvidia
so I suppose that the NVDIA GPU is well configured and detected. Am I right?
Then, I borrowed this simply code from https://enccs.github.io/openmp-gpu/target/
/* Copyright (c) 2019 CSC Training */
/* Copyright (c) 2021 ENCCS */
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#endif
int main()
{
int num_devices = omp_get_num_devices();
printf("Number of available devices %d\n", num_devices);
#pragma omp target
{
if (omp_is_initial_device()) {
printf("Running on host\n");
} else {
int nteams= omp_get_num_teams();
int nthreads= omp_get_num_threads();
printf("Running on device with %d teams in total and %d threads in each team\n",nteams,nthreads);
}
}
}
Compilation with GCC as gcc -Wall -fopenmp example.c -o example works fine and does not produce warnings nor errors, but at execution time I obtain simply
Number of available devices 0
Running on host
so not Inter nor NVIDIA GPUs were detected. Inspecting the Debian repositories I've seen and installed the packages gcc-offload-nvptx, gcc-12-offload-nvptx, libgomp-plugin-nvptx1, and nvptx-tools. Now, when I compile again as gcc -Wall -fopenmp example.c -o example I obtain a warning:
/usr/bin/ld: /tmp/ccrWP8gn.crtoffloadtable.o: warning: relocation against `__offload_vars_end' in read-only section `.rodata'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
When I execute the code I obtain again
Number of available devices 0
Running on host
but in this case the execution is terrubly slow (various seconds of time).
Searching the web I've seen that I must add the option -foffload=nvptx-none to the compilation order, but I obtain the same results as previously (this option only is recognized if gcc-offload-nvptx et al. are installed.
Running gcc -v I can see that GCC 12 in Debian is configured for offloading:
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
I don't know which number is my NVIDIA card, but I've tried export OFFLOAD_TARGET_DEFAULT=0 and export OFFLOAD_TARGET_DEFAULT=2 with the same wrong result of
Number of available devices 0
Running on host
So, how can I run my OpenMP code in the GPU?
Thanks

gcc 11.3.0 on macOS/Apple Silicon and SHA-3 instructions

I have gcc 11.3.0 installed using Homebrew on a MacBook Air with Apple Silicon M1 CPU. The binary is the aarch64 native version, not Rosetta emulated. The installed OS is macOS Monterey 12.3.
I'm having an issue compiling a program which uses the ARMv8.2-A SHA-3 extension instructions, which are supported by the M1 CPU. This is a minimal reproducible example:
#include <arm_neon.h>
int main() {
uint64x2_t a = {0}, b = {0}, c = {0};
veor3q_u64(a, b, c);
return 0;
}
This code compiles just fine with the Apple supplied clang compiler.
I compiled it using the following command line for gcc 11:
gcc-11 -o test test.c -march=armv8-a+sha3
This results in the following error:
In file included from test.c:1:
test.c: In function 'main':
/opt/homebrew/Cellar/gcc/11.3.0/lib/gcc/11/gcc/aarch64-apple-darwin21/11/include/arm_neon.h:32320:1: error: inlining failed in call to 'always_inline' 'veor3q_u64': target specific option mismatch
32320 | veor3q_u64 (uint64x2_t __a, uint64x2_t __b, uint64x2_t __c)
| ^~~~~~~~~~
test.c:5:5: note: called from here
5 | veor3q_u64(a, b, c);
| ^~~~~~~~~~~~~~~~~~~
Is this a bug in this particular hardware/software combination, or is there some command-line option I can pass to gcc to make this particular program compile?
Solved the problem. It turns out that gcc requires -march=armv8.2-a+sha3 rather than just -march=armv8-a+sha3 to compile this intrinsic. Indeed, in gcc's version of arm_neon.h, one can find this right before the block of intrinsics which includes veor3q_u64:
#pragma GCC target ("arch=armv8.2-a+sha3")

glibc set __TIMESIZE

I'm trying to port my 32 bit ARM architecture to 64 bit time values.
Reading the answers from 64-bit time_t in Linux Kernel it tells me the following:
All user space must be compiled with a 64-bit time_t, which will be supported in the coming musl-1.2 and glibc-2.32 releases, along with installed kernel headers from linux-5.6 or higher.
I'm using a custom Linux kernel version 5.10.10 and build my own gcc toolchain with crosstool-ng using glibc version 2.32.
To test the time sizes, I wrote a simple print:
printf("Timesize = %d\n", __TIMESIZE);
printf("sizeof time_t = %d\n", sizeof(tv.tv_sec));
Which gives me the output:
Timesize = 32
sizeof time_t = 4
Following the defines and typedefs in glibc-2.32, I can see the relevant types defined as follows:
#define __TIMESIZE __WORDSIZE
#define __TIME_T_TYPE __SLONGWORD_TYPE
#define __SLONGWORD_TYPE long int
But __WORDSIZE and __SLONGWORD_TYPE are both 32 bit in size for my architecture.
Is it possible to have time_t defined as 64 bit on my 32 bit architecture target?

NVCC Compiler Crashes after hours of compilation

For some reason, NVCC is crashing when trying to compile a GPU program with very long double-precision arithmetic expressions, of the form
// given double precision arrays A[ ], F[ ],
__global__ myKernel(double *A, double *F, long n){
//.......thread ids
A[t+1] = A[t]*F[t+1] + ...... (order of million of terms).... + A[t-1]*F[t]
}
The same code does get compiled successfully with GCC (around 30 minutes compiling) and even executes correctly.
The crash occurs after +20hrs of compilation time, with the following error:
nvcc error : 'cicc' died due to signal 11 (Invalid memory reference)
nvcc error : 'cicc' core dumped
make: *** [Makefile:28: obj/EquationAlfa.cu.o] Error 139
As a side note, if we change the GPU program to 32-bit float, then it does compile correctly, although still taking hours to compile.
The compilation line is this:
nvcc -std=c++14 -c -o obj/EquationAlfa.cu.o src/EquationAlfa.cu
NVCC version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Nov_30_19:08:53_PST_2020
Cuda compilation tools, release 11.2, V11.2.67
Build cuda_11.2.r11.2/compiler.29373293_0
During compilation, the RAM usage does not exceed 8~10GB, and the computer has 128GB of RAM.
Any help or directions would be greatly appreciated.

Compiling CUDA with dynamic parallelism fallback - multiple architectures/compute capability

In one application, I've got a bunch of CUDA kernels. Some use dynamic parallelism and some don't. For the purposes of either providing a fallback option if this is not supported, or simply allowing the application to continue but with reduced/partially available features, how can I go about compiling?
At the moment I'm getting invalid device function when running kernels compiled with -arch=sm_35 on a 670 (max sm_30) that don't require compute 3.5.
AFAIK you can't use multiple -arch=sm_* arguments and using multiple -gencode=* doesn't help. Also for separable compilation I've had to create an additional object file using -dlink, but this doesn't get created when using compute 3.0 (nvlink fatal : no candidate found in fatbinary due to -lcudadevrt, which I've needed for 3.5), how should I deal with this?
I believe this issue has been addressed now in CUDA 6.
Here's my simple test:
$ cat t264.cu
#include <stdio.h>
__global__ void kernel1(){
printf("Hello from DP Kernel\n");
}
__global__ void kernel2(){
#if __CUDA_ARCH__ >= 350
kernel1<<<1,1>>>();
#else
printf("Hello from non-DP Kernel\n");
#endif
}
int main(){
kernel2<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
$ nvcc -O3 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_35,code=sm_35 -rdc=true -o t264 t264.cu -lcudadevrt
$ CUDA_VISIBLE_DEVICES="0" ./t264
Hello from non-DP Kernel
$ CUDA_VISIBLE_DEVICES="1" ./t264
Hello from DP Kernel
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1
$
In my case, device 0 is a Quadro5000, a cc 2.0 device, and device 1 is a GeForce GT 640, a cc 3.5 device.
I don't believe there is a way to do this using the runtime API as of CUDA 5.5.
The only way I can think of to get around the problem is to use the driver API to perform your own architecture selection and load code from different cubin files at runtime. The APIs can be safely mixed, so it is only the context establishment-device selection-module load phase which needs to be done with the driver API. You can use the runtime API after that - you will need a little bit of homemade syntactic sugar for the kernel launches, but otherwise no code changes are required in other runtime API code.

Resources