Linking cython code against libiomp5 instead of libgomp - openmp

Is it possible to link cython code which uses OMP (say things like "prange" statements) against libiomp5 instead of libgomp using gcc? I am aware of several posts, e.g., like Telling GCC to *not* link libgomp so it links libiomp5 instead, and others, describing how one might achieve this. However, they do not seem to work for me. What am I doing wrong?
Specifically, say I am using a most recent Anaconda distribution and have some file.pyx on which I do cython -a file.pyx to get file.c. Then for libgomp I would do things like
gcc -shared -pthread -fPIC -fwrapv -O3 -ffast-math -fno-strict-aliasing -march=native -fopenmp -o file.so -I/include_dirs file.c
Which gives me a file.so that shows
>ldd file.so
...
libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x00007fc3725ab000)
...
For libiomp5, and from reading the previously mentioned posts, I was hoping that this would do the job
gcc -shared -pthread -fPIC -fwrapv -O3 -ffast-math -fno-strict-aliasing -march=native -o file.so -I/include_dirs file.c -L/lib_dirs -liomp5
Indeed, the file.so I get shows
>ldd *.so
...
libiomp5.so => /lib_dirs/libiomp5.so (0x00007ff92c717000)
...
However, when I link file.so to some code which is forced to use a specific number of OMP threads, only the version of file.so which has been linked against libgomp shows more than a single thread being used. I.e. there seems to be no error from linking to libiomp5, but the system behaves as if no OMP pragmas would have been used in the first place.
PS.: I have also tried an additional -Wl,--as-needed to the gcc options (dunnowhatfor), but that does not change the picture.
UPDATE: ----------------
Following the request of user vidyalatha-intel, the following provides an example. It is not coded for optimal style, nor solves any particular problem. It is just meant to allow to reproduce the issue.
A) Some python code to invoke a *.so lib
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
import numpy.random as rnd
import stackovfl_OMP as sc # THE lib
N = 600
# init a couple of 1d and 2d arrays
f = rnd.random(N)
e = rnd.random(N)
v = rnd.random((N,N))+1j*rnd.random((N,N))
z = np.linspace(0,3,150) + .05*1j
numthread = 4 # explicitly force 4 threads
s = []
for i in z: # for each z do stuff needing OMP in sc.sit
print(np.real(i))
s.append([np.real(i),sc.sit(i,v,e,f,numthread)])
B) The cython code stackovfl_OMP.pyx for the lib stackovfl_OMP.so, doing some (rather senseless) stuff, including three loops, the outer one of which uses OMP
# -*- coding: utf-8 -*-
# cython: language_level=3
cimport cython
cimport openmp
from cython.parallel import prange
import numpy as np
cimport numpy as np
#cython.cdivision(True)
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef np.complex128_t sit(
np.complex128_t z,
np.ndarray[np.complex128_t,ndim=2] t,
np.ndarray[np.float64_t,ndim=1] e,
np.ndarray[np.float64_t,ndim=1] f,
np.int32_t nt # num threads
):
cdef int a,b,c,l,it
l = len(e)
''' si : Partial storage for each thread in the pool
siv : Nogil memviews for numpy array r/w access '''
cdef np.ndarray[np.complex128_t,ndim=1] si = np.zeros(nt,dtype=np.complex128)
cdef complex[:] siv = si
for a in prange(l, nogil=True, num_threads=nt): # OMP parallelization outer loop
it = openmp.omp_get_thread_num() # fixed within one thread
for b in range(l):
for c in range(l): # Do 'something'
siv[it] = siv[it] + t[c,b]*t[b,a]*t[a,c]/(2*z-e[a]+e[c])*(
(f[a]-f[b])/(z-e[a]+e[b]) + (f[c]-f[b])/(z-e[b]+e[c]))
return np.sum(si) # return collected pool
With A) an B) you can go ahead as described in the original post and generate stackovfl_OMP.so, either linking against libgomp or libiomp5. As stated there, only for libgomp the machine ends up with four threads running, when you call python stackovfl.py, while the libiomp5 linked version of stackovfl_OMP.so remains with single thread only. (Additionally exporting OMP_NUM_THREADS=4 into the environment does not change this.)

This question has been mentioned by me at the Google cython-users group at https://groups.google.com/g/cython-users/c/2niCShTH4OE where the answer was finally given by cython core-developer D. Woods.
In a nutshell: Split the compile and link step. In the compile step you can then emit the OMP pragmas. The resulting *.o object file can indeed be linked directly to libiomp5. In the example I gave this boils down to, e.g.:
gcc -c -pthread -fPIC -fwrapv -O3 -ffast-math -fno-strict-aliasing -march=native -fopenmp stackovfl_OMP.c -o stackovfl_OMP.o -I/include_paths
gcc -shared -pthread -Wl,--as-needed stackovfl_OMP.o -o stackovfl_OMP.so -L/library_paths -liomp5
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/library_paths
Finally, ldd shows stackovfl_OMP.so to be linked against libiomp5 and it runs as many threads as one likes. (...do I get any performance difference w.r.t. libgomp ... nope.)
Thx. D. Woods. Credit to those who deserve it.

Related

Error trying to install QCL (Quantum Computation Language) on Mac 10.11

I am trying to install QCL-0.6.4 from this source, but I keep getting errors, when I try it with the make command in the terminal.
I came along this thread about installing QCL on OSX, but when trying to adjust the Makefile I always come across this errors:
extern.cc:84:18: error: variable length array of non-POD element type 'tComplex'
(aka 'complex<double>')
tComplex u[dim][dim];
^
extern.cc:193:9: error: variable length array of non-POD element type 'term'
term t[dim];
^
extern.cc:224:9: error: variable length array of non-POD element type 'term'
term t[dim];
Any help on this would be highly appreciated.
There are a few issues at play here which you need to overcome to get this compiling on OSX. My instructions below assume that you are running on El Capitan (10.11.1 in my instance), but you may get some milage out of them for different versions.
Firstly, Xcode currently uses Apple's LLVM Compiler as the default C++ compiler. However, this doesn't support some of GCC's extensions, such as support for non-POD variable length arrays.
To get around this, I installed and compile with GCC: if you haven't already, install Homebrew, and then install the latest GCC compiler with:
$ brew install gcc
At the time of writing, this will install GCC v5.2.0.
That should fix your initial problem, but you will instantly hit others!
The next issue is that the included libqc.a will need recompiling for x86_64. So you will need to modify the file <base_dir>/qc/Makefile with the following changes:
...
# Add:
CXX = /usr/local/Cellar/gcc/5.2.0/bin/g++-5
CXXFLAGS = $(ARCHOPT) -c -pedantic -Wall $(DEBUG) $(PRGOPT)
...
Then rebuild libqc.a:
$ cd qc; make clean; make
If all goes well, you should have a shiny new libqc.a.
Finally, modify the main Makefile <base_dir>/Makefile with the following changes:
...
# Comment out:
#PLOPT = -DQCL_PLOT
#PLLIB = -L/usr/X11/lib -lplotter
...
# Comment out:
#RLOPT = -DQCL_USE_READLINE
#RLLIB = -lreadline
#RLLIB = -lreadline -lncurses
...
# Comment out:
#CXX = g++
#CPP = $(CC) -E
#CXXFLAGS = -c $(ARCHOPT) $(DEBUG) $(PLOPT) $(RLOPT) $(IRQOPT) $(ENCOPT) -I$(QCDIR) -DDEF_INCLUDE_PATH="\"$(QCLDIR)\""
#LDFLAGS = $(ARCHOPT) -L$(QCDIR) $(DEBUG) $(PLLIB) -lm -lfl -lqc $(RLLIB)
# Add:
CXX = /usr/local/Cellar/gcc/5.2.0/bin/g++-5
CPP = $(CC) -E
CXXFLAGS = -c $(ARCHOPT) $(DEBUG) $(PLOPT) $(RLOPT) $(IRQOPT) $(ENCOPT) -I$(QCDIR) -DDEF_INCLUDE_PATH="\"$(QCLDIR)\""
LDFLAGS = $(ARCHOPT) -L$(QCDIR) $(DEBUG) $(PLLIB) -lm -ll -lqc $(RLLIB) -lc++
...
This should now allow you to build the main application as per the instructions:
$ make clean; make; make install

How can a segfault happen at runtime only because of linking unused modules?

I get a segmentation fault from a memory allocation statement just because I have linked some unrelated procedures to the binary.
I have a very simple Fortran program:
program whatsoever
!USE payload_modules
double precision,allocatable:: Vmat(:,:,:)
allocate(Vmat(2,2,2))
Vmat=1
write(*,*) Vmat
deallocate (Vmat)
! some more lines of code using procedures from payload_module
end program whatsoever
Compiling this using gfortran whatsoever.f95 -o whatsoever leads to a program with the expected behaviour. Of course, this program is not made to print eight times 1.000 but to call the payload_modules, yet hidden in the comments. However, if I compile and link the program with the modules issuing
gfortran -c -g -fPIC -ffpe-trap=overflow -pedantic -fbounds-check \
-fimplicit-none payload_module1.f90 payload_module2.f90 whatsever.f95
gcc -g -nostdlib -v -Wl,--verbose -std=gnu99 -shared -Wl,-Bsymbolic-functions \
-Wl,-z,relro -o whatsoever whatsoever.o payload_module1.o payload_module2.o
the program whatsoever doesn't run any more. I get a segmentation fault at the allocate statement. I have not yet uncommented the lines related to the modules (however, uncommenting them leads to the same behaviour)!
I know that the payload modules' code is not buggy because I ran it before from R and wrapped this working code into a f90-module. There are no name collisions; nothing in the modules is called Vmat. There is only one other call to allocate in the modules. It never caused any trouble. There is still plenty of memory left. gdb didn't give me any hints expect a memory address.
How can linking routines that are actually not called crash a program?
Compiling your code with
gfortran whatsoever.f95 -o whatsoever
is working because you link against the system libraries, everything is in place. This would correspond to
gfortran whatsoever.f95 payload_module1.f90 payload_module2.f90 -o whatsoever
which would also work. The commands you used instead omit the system libraries, and the code fails at the first time you call a function from there (the allocation). You don't see that you are missing the libraries, because you create a shared object (which is typically linked against the libraries later on).
You chose to separate compiling the objects and linking them into an executable. Doing this for Fortran program using gcc you need to specify the Fortran libraries, so there's a -lgfortran missing.
I'm not sure about that particular choice of compile options... -shared is usually used for libraries, are you sure you want a shared binary (whatever that is)?
With -nostdlib you tell the compiler not to link against the system libraries. You would then need to specify those libraries (which you don't).
For the main program test.F90 and a module payload.F90, I run
gfortran -c -g -fPIC -ffpe-trap=overflow -pedantic -fbounds-check \
-fimplicit-none payload.F90 test.F90
gcc -g -v -Wl,--verbose -std=gnu99 -Wl,-Bsymbolic-functions \
-Wl,-z,relro -lgfortran -o whatsoever test.o payload.o
This compiles and executes correctly.
It might be easier to use the advance options with gfortran:
gfortran -g -fPIC -ffpe-trap=overflow -pedantic -fbounds-check \
-fimplicit-none -Wl,-Bsymbolic-functions -Wl,-z,relro \
payload.F90 test.F90 -o whatsoever
The result is the same.

Parallel in-place sort for numpy arrays

I often need to sort large numpy arrays (few billion elements), which became a bottleneck of my code. I am looking for a way to parallelize it.
Are there any parallel implementations for the ndarray.sort() function? Numexpr module provides parallel implementation for most math operations on numpy arrays, but lacks sorting capabilities.
Maybe, it is possible to make a simple wrapper around a C++ implementation of parallel sorting, and use it through Cython?
I ended up wrapping GCC parallel sort. Here is the code:
parallelSort.pyx
# cython: wraparound = False
# cython: boundscheck = False
import numpy as np
cimport numpy as np
import cython
cimport cython
ctypedef fused real:
cython.char
cython.uchar
cython.short
cython.ushort
cython.int
cython.uint
cython.long
cython.ulong
cython.longlong
cython.ulonglong
cython.float
cython.double
cdef extern from "<parallel/algorithm>" namespace "__gnu_parallel":
cdef void sort[T](T first, T last) nogil
def numpyParallelSort(real[:] a):
"In-place parallel sort for numpy types"
sort(&a[0], &a[a.shape[0]])
Extra compiler args: -fopenmp (compile) and -lgomp (linking)
This makefile will do it:
all:
cython --cplus parallelSort.pyx
g++ -g -march=native -Ofast -fpic -c parallelSort.cpp -o parallelSort.o -fopenmp `python-config --includes`
g++ -g -march=native -Ofast -shared -o parallelSort.so parallelSort.o `python-config --libs` -lgomp
clean:
rm -f parallelSort.cpp *.o *.so
And this shows that it works:
from parallelSort import numpyParallelSort
import numpy as np
a = np.random.random(100000000)
numpyParallelSort(a)
print a[:10]
edit: fixed bug noticed in the comment below
Mergesort parallelizes quite naturally. Just have each worker pre-sort an arbitrary chunk, and then run a single merge pass on it. The final merging should require only O(N) operations, and its trivial to write a function for doing so in numba or somesuch.
Wikipedia agrees

how to compile lapack so that it can be used correctly during installation of octave?

I'm trying to install the latest octave 3.8.1 from source in a cluster running redhat+IBM LSF. I don't have write access to anywhere else except my own home dir, that's why I have to install octave from source. The blas and lapack provided by the cluster does not work so I have to build them by myself. I have now finished compiling both blas and lapack and passed the ./configure, but when I run make, an error is reported as follows:
These are steps I used to build my own BLAS and LAPACK. The source of BLAS is in ~/src/BLAS while the source of LAPACK is in ~/src/lapack-3.5.0 and the source of octave 3.8.1 is in ~/src/octave-3.8.1.
With only two module, 1) pcre/8.33 2) acml/5.3.1/gfortran64, loaded, I compiled BLAS shared library using
gfortran -shared -O2 *.f -o libblas.so -fPIC
and static library using
gfortran -O2 -c *.f -fPIC
ar cr libblas.a *.o
Then I copy the shared library libblas.so to ~/src/octave-3.8.1. The contents of make.inc file in lapack's dir is:
####################################################################
# LAPACK make include file. #
# LAPACK, Version 3.5.0 #
# November 2013 #
####################################################################
#
SHELL = /bin/sh
#
# Modify the FORTRAN and OPTS definitions to refer to the
# compiler and desired compiler options for your machine. NOOPT
# refers to the compiler options desired when NO OPTIMIZATION is
# selected. Define LOADER and LOADOPTS to refer to the loader and
# desired load options for your machine.
#
FORTRAN = gfortran
OPTS = -shared -O2 -fPIC
DRVOPTS = $(OPTS)
NOOPT = -O0 -frecursive
LOADER = gfortran
LOADOPTS =
#
# Timer for the SECOND and DSECND routines
#
# Default : SECOND and DSECND will use a call to the EXTERNAL FUNCTION ETIME
#TIMER = EXT_ETIME
# For RS6K : SECOND and DSECND will use a call to the EXTERNAL FUNCTION ETIME_
# TIMER = EXT_ETIME_
# For gfortran compiler: SECOND and DSECND will use a call to the INTERNAL FUNCTION ETIME
TIMER = INT_ETIME
# If your Fortran compiler does not provide etime (like Nag Fortran Compiler, etc...)
# SECOND and DSECND will use a call to the INTERNAL FUNCTION CPU_TIME
# TIMER = INT_CPU_TIME
# If neither of this works...you can use the NONE value... In that case, SECOND and DSECND will always return 0
# TIMER = NONE
#
# Configuration LAPACKE: Native C interface to LAPACK
# To generate LAPACKE library: type 'make lapackelib'
# Configuration file: turned off (default)
# Complex types: C99 (default)
# Name pattern: mixed case (default)
# (64-bit) Data model: LP64 (default)
#
# CC is the C compiler, normally invoked with options CFLAGS.
#
CC = gcc
CFLAGS = -O3
#
# The archiver and the flag(s) to use when building archive (library)
# If you system has no ranlib, set RANLIB = echo.
#
ARCH = ar
ARCHFLAGS= cr
RANLIB = ranlib
#
# Location of the extended-precision BLAS (XBLAS) Fortran library
# used for building and testing extended-precision routines. The
# relevant routines will be compiled and XBLAS will be linked only if
# USEXBLAS is defined.
#
# USEXBLAS = Yes
XBLASLIB =
# XBLASLIB = -lxblas
#
# The location of the libraries to which you will link. (The
# machine-specific, optimized BLAS library should be used whenever
# possible.)
#
#BLASLIB = ../../librefblas.a
BLASLIB = ~/src/BLAS/libblas.a
LAPACKLIB = liblapack.a
TMGLIB = libtmglib.a
LAPACKELIB = liblapacke.a
Then I type make to compile LAPACK. After compilation, I copied the output liblapack.a to ~/src/octave-3.8.1.
The ./configure command line is:
./configure --prefix=$HOME/bin/octave --with-blas=./libblas.so --with-lapack=$HOME/src/octave-3.8.1/liblapack.a --disable-readline --enable-64
I can pass the ./configure. Then I type make to try to build octave 3.8.1 and I got the above error.
From the make.inc file it can be seen that I have followed the suggestion of the compiler "recompile with -fPIC" and modified the make.inc accordingly. I also add -shared switch in the OPTS variable. In addition, I have tried using old LAPACK version but not working. I really have no idea why the error still comes out. So I wonder if you could please tell me how to compile the LAPACK library so that it can be correctly used during installation of octave 3.8.1. The following two points may be worth considering. (1) should I compile lapack as a static library or a shared library? (2) should -fPIC switch be applied to lapack compilation or octave's make? If the latter, how to apply -fPIC to make? You don't have to get restricted to the above two points since there may be other reasons for the error. Any advice to solve this problem is welcomed. If you need any other information please tell me. Thank you.
Just compiled the lapack shared lib on my boss's beast... Here's a link which almost did it right.
I made some changes:
(1) Adding -fPIC to
OPTS & NOOPT in make.inc
(2) Change the names in make.inc to .so
BLASLIB = ../../libblas.so
LAPACKLIB = ../liblapack.so
(3) In ./SRC, change the Makefile from
../$(LAPACKLIB): $(ALLOBJ)
$(ARCH) $(ARCHFLAGS) $# $(ALLOBJ)
$(RANLIB) $#
to
../$(LAPACKLIB): $(ALLOBJ)
$(LOADER) $(LOADOPTS) -shared -Wl,-soname,liblapack.so -o $# $(ALLOBJ) ../libblas.so
Cuz lapack is calling blas, if you miss the very last part, your liblapack.so will fail! You need to LINK liblapack.so against libblas.so ( libatlas.so is also OK). You can use "ldd liblapack.so" to check its dependency. If you see libblas.so in there, pretty much you did it right.
(4) In ./BLAS/SRC, change the Makefile from
$(BLASLIB): $(ALLOBJ)
$(ARCH) $(ARCHFLAGS) $# $(ALLOBJ)
$(RANLIB) $#
to
$(BLASLIB): $(ALLOBJ)
$(LOADER) $(LOADOPTS) -z muldefs -shared -Wl,-soname,libblas.so -o $# $(ALLOBJ)
(5) I don't need libtmg.so so that I didn't change it...
Run
make blaslib
Then
make lapacklib
You will have both of them compiled. I check the liblapack.so with building a numpy on it and Python ctypes.cdll loading. All work for me to solve eigenvalues and eigenvectors... So it should be fine...
(6) YOU MAY NEED TO SET UP LD_LIBRARY_PATH to where you keep your library files.
google it... If not set by admin, then
export LD_LIBRARY_PATH=path-to-lib
If already set, then
export LD_LIBRARY_PATH=path-to-lib:$LD_LIBRARY_PATH
to overwrite your default libs.
So that you won't have ld linking errors. Good luck!!
In lapack-3.7.0, there are redundant lines in the SRC/Makefile. Simply deleting them will solve your error.
I would suggest using OpenBLAS.
> git clone https://github.com/xianyi/OpenBLAS.git
> make
> make make --PREFIX=INSTALL_DIR install
move the librabries from OpenBLAS to /usr/lib64
> cp /path/to/OpenBLAS/lib/* /usr/lib64/
then go to the octave installation path and run
> "your specific flags" ./configure "your specific arguments" --with-blas="-lopenblas"

Does the sequence of the args matters when using gcc?

gcc -o fig fig.c -I./include ./lib/libmylib.a -g
gcc -g fig.c -o fig -I./include ./lib/libmylib.a
gcc -g -o fig fig.c -I./include ./lib/libmylib.a
It seems that the gcc accept different kinds of sequence.
However, what is a not acceptable sequence? Does the sequence of arguments matters?
One sequence that does matter is where you put libraries if you specify -static linkage.
Basically, if you choose to statically link libraries in, the libraries should be specified after your code, as GCC will scan the code first for external library dependencies and then check the libraries to bring in. If you specified the libraries before the code that needs them, GCC would scan and determine no libraries were needed, and you'd end up with linker errors.

Resources