Briefly speaking, my question relies in between compiling/building files (using libraries) with two different compilers while exploiting OpenACC constructs in source files.
I have a C source file that has an OpenACC construct. It has only a simple function that computes total sum of an array:
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
double calculate_sum(int n, double *a) {
double sum = 0;
int i;
printf("Num devices: %d\n", acc_get_num_devices(acc_device_nvidia));
#pragma acc parallel copyin(a[0:n])
#pragma acc loop
for(i=0;i<n;i++) {
sum += a[i];
}
return sum;
}
I can easily compile it using following line:
pgcc -acc -ta=nvidia -c libmyacc.c
Then, create a static library by following line:
ar -cvq libmyacc.a libmyacc.o
To use my library, I wrote a piece of code as following:
#include <stdio.h>
#include <stdlib.h>
#define N 1000
extern double calculate_sum(int n, double *a);
int main() {
printf("Hello --- Start of the main.\n");
double *a = (double*) malloc(sizeof(double) * N);
int i;
for(i=0;i<N;i++) {
a[i] = (i+1) * 1.0;
}
double sum = 0.0;
for(i=0;i<N;i++) {
sum += a[i];
}
printf("Sum: %.3f\n", sum);
double sum2 = -1;
sum2 = calculate_sum(N, a);
printf("Sum2: %.3f\n", sum2);
return 0;
}
Now, I can use this static library with PGI compiler itself to compile above source (f1.c):
pgcc -acc -ta=nvidia f1.c libmyacc.a
And it will execute flawlessly. However, it differs for gcc. My question relies in here. How can I built it properly with gcc?
Thanks to Jeff's comment on this question:
linking pgi compiled library with gcc linker, now I can build my source file (f1.c) without errors, but the executable file emits some fatal errors.
This is what I use to compile my source file with gcc (f1.c):
gcc f1.c -L/opt/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L. -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lpgmp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc -lmyacc
This is the error:
Num devices: 2
Accelerator Fatal Error: No CUDA device code available
Thanks to -v option when compiling f1.c with PGI compiler, I see that the compiler invokes so many other tools from PGI and NVidia (like pgacclnk and nvlink).
My questions:
Am I on the wrong path? Can I call functions in PGI compiled libraries from GCC and use OpenACC within those functions?
If answer to above is positive, can I use still link without steps (calling pgacclnk and nvlink) that PGI takes?
If answer to above is positive too, what should I do?
Add "-ta=tesla:nordc" to your pgcc compilation. By default PGI uses runtime dynamic compilation (RDC) for the GPU code. However RDC requires an extra link step (with nvlink) that gcc does not support. The "nordc" sub-option disables RDC so you'll be able to use OpenACC code in a library. However by disabling RDC you can no longer call external device routines from a compute region.
% pgcc -acc -ta=tesla:nordc -c libmyacc.c
% ar -cvq libmyacc.a libmyacc.o
a - libmyacc.o
% gcc f1.c -L/proj/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L. -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lpgmp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc -lmyacc
% a.out
Hello --- Start of the main.
Sum: 500500.000
Num devices: 8
Sum2: 500500.000
Hope this helps,
Mat
Related
Say I have a function with the inline keyword in a compilation unit.
If I have
// math.h
inline int sum(int x, int y);
and
// math.c
inline int sum(int x, int y)
{
return x + y;
}
and
// main.c
#include "math.h"
int main(int argc, char **argv)
{
return sum(argc,argc);
}
And building with
gcc -O3 -c math.c -o math.o
gcc -O3 -c main.c -o main.o
gcc math.o main.o
Will an optimizing compiler inline sum? Can gcc or clang inline functions from other compilation units?
GCC can (and often will) inline functions from different TUs when you compile with LTO enabled. For this you need to add -flto to CFLAGS/CXXFLAGS and LDFLAGS.
Previously, I asked a question regarding the creation of a static library with PGI and linking it to a program that is built with gcc: c - Linking a PGI OpenACC-enabled library with gcc
Now, I have the same question but dynamically. How can I built a program with gcc while my library is dynamically built with PGI?
And also, considering following facts:
I want both of them to recognize same OpenMP pragma and routines too. For instance, when I use OpenMP critical regions in the library, the whole program should be serialized at that section.
OpenACC pragmas are used in the library that was built with PGI.
Load library completely dynamic in my application. I mean using dlopen to open lib and dlsym to find functions.
I also want my threads to be able to simultaneously access GPU for data tranfer and/or computations. For more details see following code snippets.
For instance, building following lib and main code emits this error: call to cuMemcpyHtoDAsync returned error 1: Invalid value
Note: When building following codes, I intentionally used LibGOMP (-lgomp) instead of PGI's OpenMP library (-lpgmp) for both cases, lib and main.
Lib code:
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
#include <omp.h>
double calculate_sum(int n, double *a) {
double sum = 0;
int i;
#pragma omp critical
{
printf("Num devices: %d\n", acc_get_num_devices(acc_device_nvidia));
#pragma acc enter data copyin(a[0:n])
#pragma acc parallel
#pragma acc loop
for(i=0;i<n;i++) {
sum += a[i];
}
#pragma acc exit data delete(a[0:n])
}
return sum;
}
int ret_num_dev(int index) {
int dev = acc_get_num_devices(acc_device_nvidia);
if(dev == acc_device_nvidia)
printf("Num devices: %d - Current device: %d\n", dev, acc_get_device());
return dev;
}
Built library with following commands:
pgcc -acc -ta=nvidia:nordc -fPIC -c libmyacc.c
pgcc -shared -Wl,-soname,libctest.so.1 -o libmyacc.so -L/opt/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lgomp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc libmyacc.o
Main code:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <dlfcn.h>
#define N 1000
// to make sure library is loaded just once for whole program
static void *lib_handle = NULL;
static int lib_loaded = 0;
static double (*calculate_sum2)(int , double *);
void call_lib_so() {
// load library just once and init the function pointer
// to function in the library.
if(lib_loaded == 0) {
lib_loaded = 1;
char *error;
lib_handle = dlopen("/home/millad/temp/gcc-pgi/libmyacc.so", RTLD_NOW);
if (!lib_handle) {
fprintf(stderr, "%s\n", dlerror());
exit(1);
}
calculate_sum2 = (double (*)(int , double *)) dlsym(lib_handle, "calculate_sum");
if ((error = dlerror()) != NULL) {
fprintf(stderr, "%s\n", error);
exit(1);
}
}
// execute the function per call
int n = N, i;
double *a = (double *) malloc(sizeof(double) * n);
for(i=0;i<n;i++)
a[i] = 1.0 * i;
double sum = (*calculate_sum2)(n, a);
free(a);
printf("-------- SUM: %.3f\n", sum);
// dlclose(lib_handle);
}
extern double calculate_sum(int n, double *a);
int main() {
// allocation and initialization of an array
double *a = (double*) malloc(sizeof(double) * N);
int i;
for(i=0;i<N;i++) {
a[i] = (i+1) * 1.0;
}
// access and run OpenACC region with all threads
#pragma omp parallel
call_lib_so();
return 0;
}
And built my main code with following command using gcc as described by Mat in my previous question:
gcc f1.c -L/opt/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L. -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lgomp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc -lmyacc
Am I doing something wrong? Are above steps correct?
Your code works correctly for me. I tried to use what you listed but needed to remove the "libctest.so", change the location where dlopen gets the so, and add "-DN=1024" on the gcc compilation line. After that, it compiled and ran fine.
% pgcc -acc -ta=nvidia:nordc -fPIC -c libmyacc.c -V16.5
% pgcc -shared -o libmyacc.so -L/opt/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lgomp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc libmyacc.o -V16.5
% gcc f1.c -L/proj/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L. -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lgomp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc -lmyacc -DN=1024
% ./a.out
Num devices: 8
-------- SUM: 523776.000
As a university project I wrote a rudimentary benchmark for various sorting algorithms. I have to use several compiler flags such as -Wall -g0 -O3 and C++14 standard.
At some point I realized that a few time measurements didn't make sense and algorithm optimizations had no effect. A fellow student ran the same code on a Linux machine with GCC and it worked as expected. Hence, it must have been (and still is) a LLVM/clang++ configuration issue – at least that's my assumption.
So I checked the build settings within my IDE (Xcode 7.1.1) and even compiled the sources on the shell, but the problem remains.
Finally, I had a look at Xcode's report navigator and used the verbose output (-v) on the terminal. It revealed that clang++ is using many other parameters per default:
"/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang" -cc1 -triple x86_64-apple-macosx10.11.0 -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -emit-obj -disable-free -disable-llvm-verifier -main-file-name Benchmark.cpp -mrelocation-model pic -pic-level 2 -mthread-model posix -mdisable-fp-elim -masm-verbose -munwind-tables -target-cpu core2 -target-linker-version 253.6 -v -dwarf-column-info -resource-dir /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/clang/7.0.0 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk -stdlib=libc++ -O3 -Wall -std=c++14 -fdeprecated-macro -fdebug-compilation-dir /Users/Kruse/Downloads/ProgramOptimization -ferror-limit 19 -fmessage-length 142 -stack-protector 1 -mstackrealign -fblocks -fobjc-runtime=macosx-10.11.0 -fencode-extended-block-signature -fcxx-exceptions -fexceptions -fmax-type-align=16 -fdiagnostics-show-option -fcolor-diagnostics -vectorize-loops -vectorize-slp -o /var/folders/q4/rj3wqzms2fdcqlxgdzb3l5hc0000gn/T/Benchmark-1912ed.o -x c++ Benchmark.cpp
How can I strip this down to the crucial parts? Or am I on the wrong track?
Here's the code under test (don't blame me for the bubblesort-like swapping – it's a requirement):
class InsertionSort {
public:
template <typename T, size_t SIZE>
static void sort(std::array<T, SIZE> &field) {
for (size_t global = 1; global < SIZE; global++) {
for (size_t sorted = global; sorted > 0
&& field[sorted] < field[sorted - 1]; sorted--) {
std::swap(field[sorted], field[sorted - 1]);
}
}
}
template <typename T, size_t SIZE>
static void sortGuard(std::array<T, SIZE> &field) {
size_t minIndex = MinimumSearch::getMin(field);
std::swap(field[minIndex], field[0]);
for (size_t global = 2; global < SIZE; global++) {
for (size_t sorted = global; field[sorted] < field[sorted - 1]; sorted--) {
std::swap(field[sorted], field[sorted - 1]);
}
}
}
};
Minimum search is implemented as follows:
class MinimumSearch {
public:
template <typename T, size_t SIZE>
static size_t getMin(std::array<T, SIZE> &field, size_t startIndex = 0) {
size_t minIndex = startIndex;
T minVal = field[minIndex];
for (size_t i = minIndex; i < SIZE; i++) {
if (field[i] < minVal) {
minIndex = i;
minVal = field[i];
}
}
return minIndex;
}
};
InsertionSort#sortGuard(std::array<T, SIZE>&, size_t) should be faster than the default sort method. That's the case using GCC, but not with LLVM/clang++.
If you don't have command line tools installed, you can install them with the terminal command:
xcode-select --install
Once installed, you can run clang from the command line with something like:
clang++ -std=c++14 -O3 test.cpp
I'm trying to convince gccgo without success to vectorize the following snippet:
package foo
func Sum(v []float32) float32 {
var sum float32 = 0
for _, x := range v {
sum += x
}
return sum
}
I'm verifying the assembly generated by:
$ gccgo -O3 -ffast-math -march=native -S test.go
gccgo version is:
$ gccgo --version
gccgo (Ubuntu 4.9-20140406-0ubuntu1) 4.9.0 20140405 (experimental) [trunk revision 209157]
Isn't gccgo supposed to be able to vectorize this code? the equivalent C code
with the same gcc options is perfectly vectorized with AVX instructions...
UPDATE
here you have the corresponding C example:
#include <stdlib.h>
float sum(float *v, size_t n) {
size_t i;
float sum = 0;
for(i = 0; i < n; i++) {
sum += v[i];
}
return sum;
}
compile with:
$ gcc -O3 -ffast-math -march=native -S test.c
Why not just build the .so or .a with gcc and call the c function from go?
This is my first try on OpenMP, but cannot get speedup on it. The machine is Linux amd_64.
I coded the following code:
printf ("nt = %d\n", nt);
omp_set_num_threads(nt);
int i, j, s;
#pragma omp parallel for private(j,s)
for (i=0; i<10000; i++)
{
for (j=0; j<100000; j++)
{
s++;
}
}
And the compile with
g++ tempomp.cpp -o tomp -lgomp
And run it with different nthreads, no speedup:
nt = 1
elapsed time =2.670000
nt = 2
elapsed time =2.670000
nt = 12
elapsed time =2.670000
Any ideas?
I think you need to add the flag -fopenmp to your compiler:
g++ tempomp.cpp -o tomp -lgomp -fopenmp
When -fopenmp is used, the compiler will generate parallel code
based on the OpenMP directives encountered.
-lgomp loads libraries of the Gnu OpenMP Project.
How many cores do your machine have?