I'm trying to convince gccgo without success to vectorize the following snippet:
package foo
func Sum(v []float32) float32 {
var sum float32 = 0
for _, x := range v {
sum += x
}
return sum
}
I'm verifying the assembly generated by:
$ gccgo -O3 -ffast-math -march=native -S test.go
gccgo version is:
$ gccgo --version
gccgo (Ubuntu 4.9-20140406-0ubuntu1) 4.9.0 20140405 (experimental) [trunk revision 209157]
Isn't gccgo supposed to be able to vectorize this code? the equivalent C code
with the same gcc options is perfectly vectorized with AVX instructions...
UPDATE
here you have the corresponding C example:
#include <stdlib.h>
float sum(float *v, size_t n) {
size_t i;
float sum = 0;
for(i = 0; i < n; i++) {
sum += v[i];
}
return sum;
}
compile with:
$ gcc -O3 -ffast-math -march=native -S test.c
Why not just build the .so or .a with gcc and call the c function from go?
Related
If I compile the following code:
#include <boost/range/irange.hpp>
template<class Integer>
auto iota(Integer last)
{
return boost::irange(0, last);
}
template<class Integer, class StepSize>
auto iota(Integer last, StepSize step_size)
{
return boost::irange(0, last, step_size);
}
int main( )
{
int y = 0;
for (auto x : iota(5))
y += x;
return y;
}
With both clang 3.9.0 and GCC 6.3.0 at -O1, I get complete optimization by GCC (simply returns the final result) and lots and lots of clang output code. See this experiment on GodBolt. If I switch clang to -O2, however, it also compiles everything away.
What optimization differences between the two compilers' -O1 modes causes this to happen?
Previously, I asked a question regarding the creation of a static library with PGI and linking it to a program that is built with gcc: c - Linking a PGI OpenACC-enabled library with gcc
Now, I have the same question but dynamically. How can I built a program with gcc while my library is dynamically built with PGI?
And also, considering following facts:
I want both of them to recognize same OpenMP pragma and routines too. For instance, when I use OpenMP critical regions in the library, the whole program should be serialized at that section.
OpenACC pragmas are used in the library that was built with PGI.
Load library completely dynamic in my application. I mean using dlopen to open lib and dlsym to find functions.
I also want my threads to be able to simultaneously access GPU for data tranfer and/or computations. For more details see following code snippets.
For instance, building following lib and main code emits this error: call to cuMemcpyHtoDAsync returned error 1: Invalid value
Note: When building following codes, I intentionally used LibGOMP (-lgomp) instead of PGI's OpenMP library (-lpgmp) for both cases, lib and main.
Lib code:
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
#include <omp.h>
double calculate_sum(int n, double *a) {
double sum = 0;
int i;
#pragma omp critical
{
printf("Num devices: %d\n", acc_get_num_devices(acc_device_nvidia));
#pragma acc enter data copyin(a[0:n])
#pragma acc parallel
#pragma acc loop
for(i=0;i<n;i++) {
sum += a[i];
}
#pragma acc exit data delete(a[0:n])
}
return sum;
}
int ret_num_dev(int index) {
int dev = acc_get_num_devices(acc_device_nvidia);
if(dev == acc_device_nvidia)
printf("Num devices: %d - Current device: %d\n", dev, acc_get_device());
return dev;
}
Built library with following commands:
pgcc -acc -ta=nvidia:nordc -fPIC -c libmyacc.c
pgcc -shared -Wl,-soname,libctest.so.1 -o libmyacc.so -L/opt/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lgomp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc libmyacc.o
Main code:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <dlfcn.h>
#define N 1000
// to make sure library is loaded just once for whole program
static void *lib_handle = NULL;
static int lib_loaded = 0;
static double (*calculate_sum2)(int , double *);
void call_lib_so() {
// load library just once and init the function pointer
// to function in the library.
if(lib_loaded == 0) {
lib_loaded = 1;
char *error;
lib_handle = dlopen("/home/millad/temp/gcc-pgi/libmyacc.so", RTLD_NOW);
if (!lib_handle) {
fprintf(stderr, "%s\n", dlerror());
exit(1);
}
calculate_sum2 = (double (*)(int , double *)) dlsym(lib_handle, "calculate_sum");
if ((error = dlerror()) != NULL) {
fprintf(stderr, "%s\n", error);
exit(1);
}
}
// execute the function per call
int n = N, i;
double *a = (double *) malloc(sizeof(double) * n);
for(i=0;i<n;i++)
a[i] = 1.0 * i;
double sum = (*calculate_sum2)(n, a);
free(a);
printf("-------- SUM: %.3f\n", sum);
// dlclose(lib_handle);
}
extern double calculate_sum(int n, double *a);
int main() {
// allocation and initialization of an array
double *a = (double*) malloc(sizeof(double) * N);
int i;
for(i=0;i<N;i++) {
a[i] = (i+1) * 1.0;
}
// access and run OpenACC region with all threads
#pragma omp parallel
call_lib_so();
return 0;
}
And built my main code with following command using gcc as described by Mat in my previous question:
gcc f1.c -L/opt/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L. -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lgomp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc -lmyacc
Am I doing something wrong? Are above steps correct?
Your code works correctly for me. I tried to use what you listed but needed to remove the "libctest.so", change the location where dlopen gets the so, and add "-DN=1024" on the gcc compilation line. After that, it compiled and ran fine.
% pgcc -acc -ta=nvidia:nordc -fPIC -c libmyacc.c -V16.5
% pgcc -shared -o libmyacc.so -L/opt/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lgomp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc libmyacc.o -V16.5
% gcc f1.c -L/proj/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L. -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lgomp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc -lmyacc -DN=1024
% ./a.out
Num devices: 8
-------- SUM: 523776.000
Briefly speaking, my question relies in between compiling/building files (using libraries) with two different compilers while exploiting OpenACC constructs in source files.
I have a C source file that has an OpenACC construct. It has only a simple function that computes total sum of an array:
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
double calculate_sum(int n, double *a) {
double sum = 0;
int i;
printf("Num devices: %d\n", acc_get_num_devices(acc_device_nvidia));
#pragma acc parallel copyin(a[0:n])
#pragma acc loop
for(i=0;i<n;i++) {
sum += a[i];
}
return sum;
}
I can easily compile it using following line:
pgcc -acc -ta=nvidia -c libmyacc.c
Then, create a static library by following line:
ar -cvq libmyacc.a libmyacc.o
To use my library, I wrote a piece of code as following:
#include <stdio.h>
#include <stdlib.h>
#define N 1000
extern double calculate_sum(int n, double *a);
int main() {
printf("Hello --- Start of the main.\n");
double *a = (double*) malloc(sizeof(double) * N);
int i;
for(i=0;i<N;i++) {
a[i] = (i+1) * 1.0;
}
double sum = 0.0;
for(i=0;i<N;i++) {
sum += a[i];
}
printf("Sum: %.3f\n", sum);
double sum2 = -1;
sum2 = calculate_sum(N, a);
printf("Sum2: %.3f\n", sum2);
return 0;
}
Now, I can use this static library with PGI compiler itself to compile above source (f1.c):
pgcc -acc -ta=nvidia f1.c libmyacc.a
And it will execute flawlessly. However, it differs for gcc. My question relies in here. How can I built it properly with gcc?
Thanks to Jeff's comment on this question:
linking pgi compiled library with gcc linker, now I can build my source file (f1.c) without errors, but the executable file emits some fatal errors.
This is what I use to compile my source file with gcc (f1.c):
gcc f1.c -L/opt/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L. -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lpgmp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc -lmyacc
This is the error:
Num devices: 2
Accelerator Fatal Error: No CUDA device code available
Thanks to -v option when compiling f1.c with PGI compiler, I see that the compiler invokes so many other tools from PGI and NVidia (like pgacclnk and nvlink).
My questions:
Am I on the wrong path? Can I call functions in PGI compiled libraries from GCC and use OpenACC within those functions?
If answer to above is positive, can I use still link without steps (calling pgacclnk and nvlink) that PGI takes?
If answer to above is positive too, what should I do?
Add "-ta=tesla:nordc" to your pgcc compilation. By default PGI uses runtime dynamic compilation (RDC) for the GPU code. However RDC requires an extra link step (with nvlink) that gcc does not support. The "nordc" sub-option disables RDC so you'll be able to use OpenACC code in a library. However by disabling RDC you can no longer call external device routines from a compute region.
% pgcc -acc -ta=tesla:nordc -c libmyacc.c
% ar -cvq libmyacc.a libmyacc.o
a - libmyacc.o
% gcc f1.c -L/proj/pgi/linux86-64/16.5/lib -L/usr/lib64 -L/usr/lib/gcc/x86_64-redhat-linux/4.8.5 -L. -laccapi -laccg -laccn -laccg2 -ldl -lcudadevice -lpgmp -lnuma -lpthread -lnspgc -lpgc -lm -lgcc -lc -lgcc -lmyacc
% a.out
Hello --- Start of the main.
Sum: 500500.000
Num devices: 8
Sum2: 500500.000
Hope this helps,
Mat
I am trying to use lapack functions from C.
Here is some test code, copied from this question
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include "clapack.h"
#include "cblas.h"
void invertMatrix(float *a, unsigned int height){
int info, ipiv[height];
info = clapack_sgetrf(CblasColMajor, height, height, a, height, ipiv);
info = clapack_sgetri(CblasColMajor, height, a, height, ipiv);
}
void displayMatrix(float *a, unsigned int height, unsigned int width)
{
int i, j;
for(i = 0; i < height; i++){
for(j = 0; j < width; j++)
{
printf("%1.3f ", a[height*j + i]);
}
printf("\n");
}
printf("\n");
}
int main(int argc, char *argv[])
{
int i;
float a[9], b[9], c[9];
srand(time(NULL));
for(i = 0; i < 9; i++)
{
a[i] = 1.0f*rand()/RAND_MAX;
b[i] = a[i];
}
displayMatrix(a, 3, 3);
return 0;
}
I compile this with gcc:
gcc -o test test.c \
-lblas -llapack -lf2c
n.b.: I've tried those libraries in various orders, I've also tried others libs like latlas, lcblas, lgfortran, etc.
The error message is:
/tmp//cc8JMnRT.o: In function `invertMatrix':
test.c:(.text+0x94): undefined reference to `clapack_sgetrf'
test.c:(.text+0xb4): undefined reference to `clapack_sgetri'
collect2: error: ld returned 1 exit status
clapack.h is found and included (installed as part of atlas). clapack.h includes the offending functions --- so how can they not be found?
The symbols are actually in the library libalapack (found using strings). However, adding -lalapack to the gcc command seems to require adding -lcblas (lots of undefined cblas_* references). Installing cblas automatically uninstalls atlas, which removes clapack.h.
So, this feels like some kind of dependency hell.
I am on FreeBSD 10 amd64, all the relevant libraries seem to be installed and on the right paths.
Any help much appreciated.
Thanks
Ivan
I uninstalled everything remotely relevant --- blas, cblas, lapack, atlas, etc. --- then reinstalled atlas (from ports) alone, and then the lapack and blas packages.
This time around, /usr/local/lib contained a new lib file: libcblas.so --- previous random installations must have deleted it.
The gcc line that compiles is now:
gcc -o test test.c \
-llapack -lblas -lalapack -lcblas
Changing the order of the -l arguments doesn't seem to make any difference.
I am not the best when it comes to compiling/writing makefiles.
I am trying to write a program that uses both GSL and OpenMP.
I have no problem using GSL and OpenMP separately, but I'm having issues using both. For instance, I can compile the GSL program
http://www.gnu.org/software/gsl/manual/html_node/An-Example-Program.html
By typing
$gcc -c Bessel.c
$gcc Bessel.o -lgsl -lgslcblas -lm
$./a.out
and it works.
I was also able to compile the program that uses OpenMP that I found here:
Starting a thread for each inner loop in OpenMP
In this case I typed
$gcc -fopenmp test_omp.c
$./a.out
And I got what I wanted (all 4 threads I have were used).
However, when I simply write a program that combines the two codes
#include <stdio.h>
#include <gsl/gsl_sf_bessel.h>
#include <omp.h>
int
main (void)
{
double x = 5.0;
double y = gsl_sf_bessel_J0 (x);
printf ("J0(%g) = %.18e\n", x, y);
int dimension = 4;
int i = 0;
int j = 0;
#pragma omp parallel private(i, j)
for (i =0; i < dimension; i++)
for (j = 0; j < dimension; j++)
printf("i=%d, jjj=%d, thread = %d\n", i, j, omp_get_thread_num());
return 0;
}
Then I try to compile to typing
$gcc -c Bessel_omp_test.c
$gcc Bessel_omp_test.o -fopenmp -lgsl -lgslcblas -lm
$./a.out
The GSL part works (The Bessel function is computed), but only one thread is used for the OpenMP part. I'm not sure what's wrong here...
You missed the worksharing directive for in your OpenMP part. It should be:
// Just in case GSL modifies the number of threads
omp_set_num_threads(omp_get_max_threads());
omp_set_dynamic(0);
#pragma omp parallel for private(i, j)
for (i =0; i < dimension; i++)
for (j = 0; j < dimension; j++)
printf("i=%d, jjj=%d, thread = %d\n", i, j, omp_get_thread_num());
Edit: To summarise the discussion in the comments below, the OP failed to supply -fopenmp during the compilation phase. That prevented GCC from recognising the OpenMP directives and thus no paralle code was generated.
IMHO, it's incorrect to declare the variables i and j as shared. Try declaring them private. Otherwise, each thread would get the same j and j++ would generate a race condition among threads.