I often need to sort large numpy arrays (few billion elements), which became a bottleneck of my code. I am looking for a way to parallelize it.
Are there any parallel implementations for the ndarray.sort() function? Numexpr module provides parallel implementation for most math operations on numpy arrays, but lacks sorting capabilities.
Maybe, it is possible to make a simple wrapper around a C++ implementation of parallel sorting, and use it through Cython?
I ended up wrapping GCC parallel sort. Here is the code:
parallelSort.pyx
# cython: wraparound = False
# cython: boundscheck = False
import numpy as np
cimport numpy as np
import cython
cimport cython
ctypedef fused real:
cython.char
cython.uchar
cython.short
cython.ushort
cython.int
cython.uint
cython.long
cython.ulong
cython.longlong
cython.ulonglong
cython.float
cython.double
cdef extern from "<parallel/algorithm>" namespace "__gnu_parallel":
cdef void sort[T](T first, T last) nogil
def numpyParallelSort(real[:] a):
"In-place parallel sort for numpy types"
sort(&a[0], &a[a.shape[0]])
Extra compiler args: -fopenmp (compile) and -lgomp (linking)
This makefile will do it:
all:
cython --cplus parallelSort.pyx
g++ -g -march=native -Ofast -fpic -c parallelSort.cpp -o parallelSort.o -fopenmp `python-config --includes`
g++ -g -march=native -Ofast -shared -o parallelSort.so parallelSort.o `python-config --libs` -lgomp
clean:
rm -f parallelSort.cpp *.o *.so
And this shows that it works:
from parallelSort import numpyParallelSort
import numpy as np
a = np.random.random(100000000)
numpyParallelSort(a)
print a[:10]
edit: fixed bug noticed in the comment below
Mergesort parallelizes quite naturally. Just have each worker pre-sort an arbitrary chunk, and then run a single merge pass on it. The final merging should require only O(N) operations, and its trivial to write a function for doing so in numba or somesuch.
Wikipedia agrees
Related
Is it possible to link cython code which uses OMP (say things like "prange" statements) against libiomp5 instead of libgomp using gcc? I am aware of several posts, e.g., like Telling GCC to *not* link libgomp so it links libiomp5 instead, and others, describing how one might achieve this. However, they do not seem to work for me. What am I doing wrong?
Specifically, say I am using a most recent Anaconda distribution and have some file.pyx on which I do cython -a file.pyx to get file.c. Then for libgomp I would do things like
gcc -shared -pthread -fPIC -fwrapv -O3 -ffast-math -fno-strict-aliasing -march=native -fopenmp -o file.so -I/include_dirs file.c
Which gives me a file.so that shows
>ldd file.so
...
libgomp.so.1 => /usr/lib64/libgomp.so.1 (0x00007fc3725ab000)
...
For libiomp5, and from reading the previously mentioned posts, I was hoping that this would do the job
gcc -shared -pthread -fPIC -fwrapv -O3 -ffast-math -fno-strict-aliasing -march=native -o file.so -I/include_dirs file.c -L/lib_dirs -liomp5
Indeed, the file.so I get shows
>ldd *.so
...
libiomp5.so => /lib_dirs/libiomp5.so (0x00007ff92c717000)
...
However, when I link file.so to some code which is forced to use a specific number of OMP threads, only the version of file.so which has been linked against libgomp shows more than a single thread being used. I.e. there seems to be no error from linking to libiomp5, but the system behaves as if no OMP pragmas would have been used in the first place.
PS.: I have also tried an additional -Wl,--as-needed to the gcc options (dunnowhatfor), but that does not change the picture.
UPDATE: ----------------
Following the request of user vidyalatha-intel, the following provides an example. It is not coded for optimal style, nor solves any particular problem. It is just meant to allow to reproduce the issue.
A) Some python code to invoke a *.so lib
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import numpy as np
import numpy.random as rnd
import stackovfl_OMP as sc # THE lib
N = 600
# init a couple of 1d and 2d arrays
f = rnd.random(N)
e = rnd.random(N)
v = rnd.random((N,N))+1j*rnd.random((N,N))
z = np.linspace(0,3,150) + .05*1j
numthread = 4 # explicitly force 4 threads
s = []
for i in z: # for each z do stuff needing OMP in sc.sit
print(np.real(i))
s.append([np.real(i),sc.sit(i,v,e,f,numthread)])
B) The cython code stackovfl_OMP.pyx for the lib stackovfl_OMP.so, doing some (rather senseless) stuff, including three loops, the outer one of which uses OMP
# -*- coding: utf-8 -*-
# cython: language_level=3
cimport cython
cimport openmp
from cython.parallel import prange
import numpy as np
cimport numpy as np
#cython.cdivision(True)
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef np.complex128_t sit(
np.complex128_t z,
np.ndarray[np.complex128_t,ndim=2] t,
np.ndarray[np.float64_t,ndim=1] e,
np.ndarray[np.float64_t,ndim=1] f,
np.int32_t nt # num threads
):
cdef int a,b,c,l,it
l = len(e)
''' si : Partial storage for each thread in the pool
siv : Nogil memviews for numpy array r/w access '''
cdef np.ndarray[np.complex128_t,ndim=1] si = np.zeros(nt,dtype=np.complex128)
cdef complex[:] siv = si
for a in prange(l, nogil=True, num_threads=nt): # OMP parallelization outer loop
it = openmp.omp_get_thread_num() # fixed within one thread
for b in range(l):
for c in range(l): # Do 'something'
siv[it] = siv[it] + t[c,b]*t[b,a]*t[a,c]/(2*z-e[a]+e[c])*(
(f[a]-f[b])/(z-e[a]+e[b]) + (f[c]-f[b])/(z-e[b]+e[c]))
return np.sum(si) # return collected pool
With A) an B) you can go ahead as described in the original post and generate stackovfl_OMP.so, either linking against libgomp or libiomp5. As stated there, only for libgomp the machine ends up with four threads running, when you call python stackovfl.py, while the libiomp5 linked version of stackovfl_OMP.so remains with single thread only. (Additionally exporting OMP_NUM_THREADS=4 into the environment does not change this.)
This question has been mentioned by me at the Google cython-users group at https://groups.google.com/g/cython-users/c/2niCShTH4OE where the answer was finally given by cython core-developer D. Woods.
In a nutshell: Split the compile and link step. In the compile step you can then emit the OMP pragmas. The resulting *.o object file can indeed be linked directly to libiomp5. In the example I gave this boils down to, e.g.:
gcc -c -pthread -fPIC -fwrapv -O3 -ffast-math -fno-strict-aliasing -march=native -fopenmp stackovfl_OMP.c -o stackovfl_OMP.o -I/include_paths
gcc -shared -pthread -Wl,--as-needed stackovfl_OMP.o -o stackovfl_OMP.so -L/library_paths -liomp5
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/library_paths
Finally, ldd shows stackovfl_OMP.so to be linked against libiomp5 and it runs as many threads as one likes. (...do I get any performance difference w.r.t. libgomp ... nope.)
Thx. D. Woods. Credit to those who deserve it.
I am new to the openmp parallelization, and will be thankful if someone can provide advice.
My target is to enable openmp parallelization and reduce the simulation timing by increasing the number of threads.
I am using intel compiler in linux to proceed with openmp parallelization by using a makefile.
The makefile is as following.
# for ifort
ifeq (${FC},ifort)
MKLROOT = /opt/intel/mkl/lib/intel64
FFLAGS += -I${MKLROOT}/include/intel64 -I${MKLROOT}/include
LDFLAGS += -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -qopenmp -liomp5 -lpthread -lm -ldl
endif
While I run this script, if I increase the number of threads, then there is no particular effect on the parallelization timing. I was wondering if there is any problem in the script.
Can someone guide please? Thank you very much.
I compile a Fortran 90 code with mpif90 compiler with two different makefiles, the first one looks like;
FC = mpif90
FFLAGS = -Wall -ffree-line-length-none
FOPT = -O3
all: ParP2S.o ParP2S
ParP2S.o: ParP2S.f90
$(FC) $(FFLAGS) $(FOPT) ParP2S.f90 -c
ParP2S: ParP2S.o
$(FC) $(FFLAGS) $(FOPT) ParP2S.o -o ParP2S
clean:
rm -f *.o* rm -f *.o*
the second makefile looks very similar, I just added the -fopenmp flag;
FC = mpif90
FFLAGS = -Wall -ffree-line-length-none -fopenmp
FOPT = -O3
all: ParP2S.o ParP2S
ParP2S.o: ParP2S.f90
$(FC) $(FFLAGS) $(FOPT) ParP2S.f90 -c
ParP2S: ParP2S.o
$(FC) $(FFLAGS) $(FOPT) ParP2S.o -o ParP2S
clean:
rm -f *.o* rm -f *.o*
The second makefile is for a hybrid (MPI with OpenMP) version of the code. For now, I have exactly the same code but compiled with these different makefiles. In the second case, the code is more than 100 times slower. Any comments in what I am doing wrong?
edit 1: I am not running multi-threaded tasks. In fact, the code does not have any OpenMP directives, it is just the pure MPI code but compiled with a different makefile. Nevertheless, I did try running after the following commands (see below) and it didn't helped.
export MV2_ENABLE_AFFINITY=0
export OMP_NUM_THREADS=1
export OMP_PROC_BIND=true
mpirun -np 2 ./ParP2S
edit 2: I am using gcc version 4.9.2 (I know there was a bug with vectorization with fopenmp in an older version). I thought the inclusion of the -fopenmp flag could be inhibiting the compiler optimizations, however, after reading the interesting discussion (May compiler optimizations be inhibited by multi-threading?) I am not sure if this is the case. Furthermore, as my code does not have any OpenMP directives, I don't see why the code compiled with -fopenmp should be that slower.
edit3: When I run without -fopenmp (first makefile) it takes about 0.2 seconds without optimizations (-O0) and 0.08 seconds with optimizations (-O3), but including the flag -fopenmp it takes about 11 seconds with -O3 or -O0.
It turned out that the problem was really task affinity, as suggested by Vladimir F and Gilles Gouaillardet (thank you very much!).
First I realized I was running MPI with OpenMPI version 1.6.4 and not MVAPICH2, so the command export MV2_ENABLE_AFFINITY=0 has no real meaning here. Second, I was (presumably) taking care of the affinity of different OpenMP threads by setting
export OMP_PROC_BIND=true
export OMP_PLACES=cores
but I was not setting the correct bindings for the MPI processes, as I was incorrectly launching the application as
mpirun -np 2 ./Par2S
and it seems that, with OpenMPI version 1.6.4, a more appropriate way to do it is
mpirun -np 2 -bind-to-core -bycore -cpus-per-proc 2 ./hParP2S
The command -bind-to-core -bycore -cpus-per-proc 2 assures 2 cores for my application (see https://www.open-mpi.org/doc/v1.6/man1/mpirun.1.php and also Gilles Gouaillardet's comments on Ensure hybrid MPI / OpenMP runs each OpenMP thread on a different core). Without it, both MPI processes were going to one single core, which was the reason for the poor efficiency of the code when the flag -fopenmp was used in the Makefile.
Apparently, when running pure MPI code compiled without the -fopenmp flag different MPI processes go automatically to different cores, but with -fopenmp one needs to specify the bindings manually as described above.
As a matter of completeness, I should mention that there is no standard for setting the correct task affinity, so my solution will not work on e.g. MVAPICH2 or (possibly) different versions of OpenMPI. Furthermore, running nproc MPI processes with nthreads each in ncores CPUs would require e.g.
export OMP_PROC_BIND=true
export OMP_PLACES=cores
export OMP_NUM_THREADS=nthreads
mpirun -np nproc -bind-to-core -bycore -cpus-per-proc ncores ./hParP2S
where ncores=nproc*nthreads.
ps: my code has an MPI_all_to_all. The condition where more than one MPI process are on one single core (no hyperthreading) calling this subroutine should be the reason why the code was about 100 times slower.
Although there are a few tutorials on the web showing how to compile a C program utilizing Haskell functions, every tutorial compiles their C code with ghc. In the ``Real World'' C code files will be compiled using gcc.
My goal is to create .o files from Haskell code and then link them to the core C program. Below is a basic working example.
Fibonacci.hs:
{-# LANGUAGE ForeignFunctionInterface #-}
module Fibonacci where
import Foreign
import Foreign.C.Types
fibonacci :: Int -> Int
fibonacci n = fibs !! n
where fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
fibonacci_hs :: CInt -> CInt
fibonacci_hs = fromIntegral . fibonacci . fromIntegral
foreign export ccall fibonacci_hs :: CInt -> CInt
test.c:
#include "Fibonacci_stub.h"
#include <stdio.h>
int main(int argc, char *argv[])
{
int i;
hs_init(&argc, &argv);
i = fibonacci_hs(42);
printf("Fibonacci: %d\n", i);
hs_exit();
return 0;
}
Makefile:
all: test
test: Fibonacci.o test.o
gcc -shared Fibonacci.o test.o -o test
test.o: test.c
gcc -c -I/usr/lib/ghc/include test.c
Fibonacci.o: Fibonacci.hs
ghc -c -O Fibonacci.hs
clean:
rm *.o *.hi test *_stub.h
https://dl.dropboxusercontent.com/u/14826353/fibonacci.tgz
When running make, first: the ghc compiler will generate a Fibonacci_stub.h and a Fibonacci.o file. Second: the gcc compiler will generate the test.o file from the test.c file. Lastly the gcc compiler should link the .o files and generate the executable.
Instead this error is provided:
/usr/bin/ld: Fibonacci.o: relocation R_X86_64_PC32 against undefined symbol `base_GHCziBase_plusInt_closure' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Bad value
The output says to use -fPIC, I've tried to place it in several places, but these attempts have not been fruitful in eliminating the error.
Where should I place -fPIC in the Makefile?
Is the -fPIC suggestion useful? Are there obvious deficits in the Makefile its self, which would prevent compilation?
I've been faced similar issue trying to put ghc's .o into shared library and realized that ghc just ignores -fPIC without immediate -shared flag (so your suggestion is correct). For detailed discussion see this link. The ticket by link marked as «solved», but nothing actually changed in ghc command lines since 2009. Briefly speaking, the reason of that behaviour lies in ghc implementation and only proposed solution is to build a separate shared library from Haskell code.
I've recursive but not tail recursive inline function for which I'd want gcc to unroll the recursion. Yes, I'm using g++ -O3 -funroll-loops of course.
inline void recurse_fun(..., unsigned depth = 0, unsigned max_depth = 40) {
if (++depth > max_depth) return;
for (auto i = ..., iend = ...; i != iend; i++) {
if (...) continue;
...
recurse_fun(...,depth,max_depth);
}
}
I could easily replace this by handling a stack<...> object manually, which gcc should unroll properly, but it would not be as quite as elegant or maintainable.
I should really try profiling both versions regardless, but I'm curious if anyone can say with confidence that some recent gcc version would or would not handle this correctly.
GCC (at least recent versions like 4.5 or 4.6) does unroll some tail recursive calls.
Of course you need to ask it to optimize (so -O2 or -O3 is required).
To understand what it is doing you can
Ask for the assembly output with something like gcc -O3 -fverbose-asm -S yoursource.c
Ask for various dump files, like gcc -c -fdump-tree-all -fdump-ipa-all -O3 yoursource.c (and there are other dump files)
Beware that GCC would print a lot (hundreds!) of dump files. And the dump files are only to help GCC developers or GCC plugin developers (or GCC MELT developpers). Don't expect them to stay in the same format from one release of GCC to the next.
The numbering of the dump files is useless: it is not chronological or logical.
And the dump options are likely to change in next GCC release (4.7, probably in 2012)