I've implemented the code bellow that generate vectors of random number using the MKL VSL library:
! ifort -mkl test1.f90 -cpp -openmp
include "mkl_vsl.f90"
#define ITERATION 1000000
#define LENGH 10000
program test
use mkl_vsl_type
use mkl_vsl
use mkl_service
use omp_lib
implicit none
integer i,brng, method, seed, dm,n,errcode
real(kind=8) r(LENGH) , s
real(kind=8) a, b, start,endd
TYPE (VSL_STREAM_STATE) :: stream
integer(4) :: nt
! *****
brng = VSL_BRNG_SOBOL
method = VSL_RNG_METHOD_UNIFORM_STD
seed = 777
a = 0.0
b = 1.0
s = 0.0
!call omp_set_num_threads(4)
call omp_set_dynamic(0)
nt = omp_get_max_threads()
! *****
print *,'max OMP threads number',nt
if (1 == omp_get_dynamic()) then
print '(" Intel OMP may use less than "I0" threads for a large problem")', nt
else
print '(" Intel OMP should use "I0" threads for a large problem")', nt
end if
if (1 == omp_get_max_threads()) print *, "Intel MKL does not employ threading"
!call mkl_set_num_threads(4)
call mkl_set_dynamic(0)
nt = mkl_get_max_threads()
print *,'max MKL threads number',nt
if (1 == mkl_get_dynamic()) then
print '(" Intel MKL may use less than "I0" threads for a large problem")', nt
else
print '(" Intel MKL should use "I0" threads for a large problem")', nt
end if
if (1 == mkl_get_max_threads()) print *, "Intel MKL does not employ threading"
! ***** Initialize *****
errcode=vslnewstream( stream, brng, seed )
! ***** Call RNG *****
start=omp_get_wtime()
do i=1,ITERATION
errcode=vdrnguniform( method, stream, LENGH, r, a, b )
s = s + sum(r)/LENGH
end do
endd=omp_get_wtime()
! ***** DEleting the stream *****
errcode=vsldeletestream(stream)
! *****
print *, s/ITERATION, endd-start
end program test
I don't see any speedup when using 4 and 32 threads for instance.
I use the Intel compiler version 13.1.3 and compile doing
ifort -mkl test1.f90 -cpp -openmp
It's like the random numbers are not generated in parallel.
Any hints here?
Thank you,
Éric.
Your code doesn't contain any OpenMP directives to actually parallelise the work, when it executes it runs only 1 thread. It is not sufficient to use omp_lib and to scatter a few calls to functions such as omp_get_wtime around, you actually have to insert some worksharing directives.
If I run your code, as is, my performance monitor shows that only one thread is active, and your code reports
max OMP threads number 16
Intel OMP should use 16 threads for a large problem
max MKL threads number 16
Intel MKL should use 16 threads for a large problem
0.499972674509302 11.2807227574035
If I simply wrap the loop in an OpenMP worksharing directive, like this
!$omp parallel do
do i=1,ITERATION
errcode=vdrnguniform( method, stream, LENGH, r, a, b )
s = s + sum(r)/LENGH
end do
!$omp end parallel do
then the performance monitor on my dual-quad-core-with-hyperthreading-PC shows that 16 threads are active and your program reports
max OMP threads number 16
Intel OMP should use 16 threads for a large problem
max MKL threads number 16
Intel MKL should use 16 threads for a large problem
0.380979220384302 7.17352125150956
I guess the hint I would offer is: study your favourite OpenMP tutorial, in particular the sections covering the parallel and do directives. I offer no warranty that the simple modification I have made does not break your program; in particular I don't guarantee that I haven't introduced a race condition.
I leave you the exercise of determining whether the speed-up on going from 1 to 16 (hyper-)threads is acceptable and any analysis of why it appears to be so modest.
Related
Here are the files for test.
# CMakeLists.txt
cmake_minimum_required(VERSION 3.16)
project(CALLBACK_TEST)
set(CMAKE_CXX_STANDARD 17)
add_compile_options(-O3 -fopenmp -fPIC)
add_link_options(-fopenmp)
add_subdirectory(pybind11)
pybind11_add_module(callback callback.cpp)
add_custom_command(TARGET callback POST_BUILD
COMMAND ${CMAKE_COMMAND} -E create_symlink $<TARGET_FILE:callback> ${CMAKE_CURRENT_SOURCE_DIR}/callback.so
)
// callback.cpp
#include <cmath>
#include <functional>
#include <vector>
#include <pybind11/pybind11.h>
#include <pybind11/functional.h>
namespace py = pybind11;
class C
{
public:
C(std::function<float(float)> f, size_t s) : f_(f), v_(s, 1) {}
void apply()
{
#pragma omp parallel for
for (size_t i = 0; i < v_.size(); i++)
v_[i] = f_(v_[i]);
}
void apply_direct()
{
#pragma omp parallel for
for (size_t i = 0; i < v_.size(); i++)
v_[i] = log(1 + v_[i]);
}
private:
std::vector<float> v_;
std::function<float(float)> f_;
};
PYBIND11_MODULE(callback, m)
{
py::class_<C>(m, "C")
.def(py::init<std::function<float(float)>, size_t>())
.def("apply", &C::apply, py::call_guard<py::gil_scoped_release>())
.def("apply_direct", &C::apply_direct);
m.def("log1p", [](float x) -> float
{ return log(1 + x); });
}
# callback.py
import math
import time
from callback import C, log1p
def run(n, func):
start = time.time()
if func:
for _ in range(n):
c = C(func, 1000)
c.apply()
else:
for _ in range(n):
c = C(func, 1000)
c.apply_direct()
end = time.time()
print(end - start)
if __name__ == "__main__":
n = 1000
one = 1
print("Python")
run(n, lambda x: math.log(x + 1))
print("C++")
run(n, log1p)
print("Direct")
run(n, None)
I run the Python script on a server with 48 CPU cores. Here is the running time. It shows 1. the running time increases when OMP_NUM_THREADS increases especially when accepting the Python/C++ callback from Python, and 2. keeping everything inside C++ is much faster, which seems to contradict the "no overhead" claim as in the documentation.
$ python callback.py
Python
19.612852573394775
C++
19.268250226974487
Direct
0.04382634162902832
$ OMP_NUM_THREADS=4 python callback.py
Python
6.042902708053589
C++
5.48648738861084
Direct
0.03322458267211914
$ OMP_NUM_THREADS=1 python callback.py
Python
0.5964927673339844
C++
0.38849639892578125
Direct
0.020793914794921875
And when OpenMP is turned off:
$ python callback.py
Python
0.8492450714111328
C++
0.26660943031311035
Direct
0.010872125625610352
So what goes wrong here?
There are several issues in your code.
First of all, the OpenMP parallel region should have a significant overhead here since it needs to share the work between 48 threads. This work-sharing can be quite expensive on some platform regarding the scheduling policy. You need to use schedule(static) to minimize this overhead. In the worst case, a runtime could create 48 threads and join them every time which is expensive. Creating/Joining 48*1000 threads would be very expensive (it should take at least several seconds). The higher the number of thread, the slower the program. That being said, most runtimes try to keep an active pool of threads. Still, this is not always possible (and this is an optimization, not required by the specification). Note that most OpenMP runtimes detect the case where OMP_NUM_THREADS is set to 1 so to have a very low overhead in this case. The general rule of thumb is to avoid using multithreading for very short operations like one taking less than 1 ms.
Moreover, the parallel for loop is subject to false sharing. Indeed, the vector of 1000 float items will take 4000 bytes in memory and it will be spread in 63 cache lines of 64 bytes on mainstream platforms. With 48 threads, almost all cache lines have to move between cores which is expensive compared to the computation done. When two threads working on adjacent cache line have an interleaved execution, a cache line can bounce many times for just few iteration. On NUMA architecture, this is even more expensive since cache lines have to move between NUMA nodes. Doing this 1000 times is very expensive.
Additionally, AFAIK calling a python function from a parallel context is either not safe, or is subject to no speed-up because of the global interpreter lock (GIL). By not safe, I mean that the CPython interpreter data structure can be corrupted causing non-deterministic crashes. This is why the GIL exists. The GIL prevent all code to scale on multiple thread as long as it is not released. Releasing a GIL for a too short period also cause cache line bouncing which is detrimental for performance (more than using a sequential code).
Finally, the "C++" and Python have a much bigger overhead than the "direct" method because they are calling dynamically-defined functions that cannot be inlined or vectorized by the compiler. Python functions are especially slow because of the CPython interpreter. If you want to make a fair benchmark you need to compare the PyBind solution with one that use std::function (be careful about clever compiler optimizations though).
I want to distribute subroutines to different tasks with OpenMP.
In my code I implemented this:
!$omp parallel
!$omp single
do thread = 1, omp_get_num_threads()
!$omp task
write(*,*) "Task,", thread, "is computing"
call find_pairs(me, thread, points)
call count_neighbors(me, thread, neighbors(:, thread))
!$omp end task
end do
!$omp end single
!$omp end parallel
The subroutines find_neighbors and count_neighbors do some calculations.
I set the number of threads in my program before with:
nr_threads = 4
call omp_set_num_threads(nr_threads)
Compiling this with GNU Fortran (Ubuntu 8.3.0-6ubuntu1) 8.3.0 and running,
gives me only one thread, running at nearly 100% when monitoring with top. Nevertheless, it prints the right
Task, 1 is computing
Task, 2 is computing
Task, 3 is computing
Task, 4 is computing
I compile it using:
gfortran -fopenmp main.f90 -o program
What I want is to distribute different calls of the subroutines according to
the number of OpenMP threads, working in parallel.
From what I understand is, that a single thread is created which creates the different
tasks.
I am preparing for an exam and are doing some exercises without facit. So I am been giving this code and are wondering if I have turned the code into SIMD instructions.
The code
int A[100000];
int B[100000];
int C=0;
for int(i=0; i < 100000; i++)
C += A[i] * B[i];
Since there is no remainder, we don't need to take care of it. We also assume that it is a 128 bit register, and therefore can calculate 4 single precision floating point values.
My result - using SIMD
int A[100000];
int B[100000];
int C=0;
for int(i=0; i < 100000/4; i += 4)
C += A[i] * B[i];
C += A[i+1] * B[i+1];
C += A[i+2] * B[i+2];
C += A[i+3] * B[i+3];
What advantages can you see for using SIMD instructions instead of writing programs with multiple threads?
Assuming the omitted curly braces on your second loop is simply a typo, and typo in the for loop, and the fact that you ask about multiplying floats but your code shows arrays of ints, this won't get great vectorisation even if the compiler sees it. While the compiler might do the loads of 4 values from A and B as a single instruction each, and do the 4 multiplies in one instruction, your code forces the compiler to then extract each of the 4 products and sum them sequentially, and getting individual values out of a SIMD register is typically quite slow.
If on the other hand you did this
float A[100000];
float B[100000];
float C0=0, C1=0, C2=0, C3=0;
for (size_t i=0; i < 100000/4; i += 4)
{
C0 += A[i+0] * B[i+0];
C1 += A[i+1] * B[i+1];
C2 += A[i+2] * B[i+2];
C3 += A[i+3] * B[i+3];
}
float C = (C0 + C1) + (C2 + C3);
Then a good compiler could vectorise this as now it sees that within each loop it loads two SIMD registers, multiplies them, then it can add the result to a SIMD register of the sums, and only extracts those 4 sums and sums them all at the end.
A vectorising compile can do this with SIMD and it will not change the order of evaluation of individual sums (FP maths is NOT associative). The compiler is typically not allowed to change the order of FP maths for this reason (not without some extra flags that allow it to technically breach the language standards), so the code above can be precisely represented by SIMD instructions, and will run much faster (in fact I'd unwind the loop a further stage as the multiplication will be a bottleneck as it stands).
This is sort of the trick with SIMD, you have to understand and then think how the operation would be best implemented with vector instructions, and then write your code to execute the same sequence of operations, and hope the compiler spots what you've done.
Or you can write the vector instructions yourself with intrinsics, or use OpenMP or similar to tell the compiler more explicitly what to do.
Amongst the advantages of SIMD over threads for such an operation is the fact that you're making use of more of the silicon within a single core... so you're not preventing another thread from getting cycles. On our compute grid we typically run many single threaded processes on any one machine to keep all the cores busy at all times... in such a case doing this sum using more cores is a false economy, you'd simply be stealing cycles that another thread could usefully be running another job.
Yes, the provided code should compile into SIMD instructions with capable CPUs and compilers.
On vector-capable processors, SIMD exposes hardware features that greatly accelerate identical, parallel computations. For instance, SIMD typically makes better use of the cache on a single core due to streaming RAM access, assuming the data being processed is localized in contiguous areas of memory. Using multiprocessing, cache competition and other synchronization overhead could actually reduce performance as the various cores attempt to write data simultaneously. This is in addition to the intrinsic boost on von-Neumann machines from only having to read one, not four, separate instructions from the shared system memory.
The logic to do these arithmetic operations in parallel is always present, but requires specific SIMD instructions to utilize. As a result, SIMD tends to be used in hot loops where hand tuning makes overall optimization sense.
I am testing FFTW in a fortran program, because I need to use it. Since I am working with huge matrixes, my first solution is to use OpenMP. When my matrix has dimension 500 x 500 x 500, the following error happens:
Operating system error:
Program aborted. Backtrace:
Cannot allocate memory
Allocation would exceed memory limit
I compiled the code using the following: gfortran -o test teste_fftw_openmp.f90 -I/usr/local/include -L/usr/lib/x86_64-linux-gnu -lfftw3_omp -lfftw3 -lm -fopenmp
PROGRAM test_fftw
USE omp_lib
USE, intrinsic:: iso_c_binding
IMPLICIT NONE
INCLUDE 'fftw3.f'
INTEGER::i, DD=500
DOUBLE COMPLEX:: OUTPUT_FFTW(3,3,3)
DOUBLE COMPLEX, ALLOCATABLE:: A3D(:,:,:), FINAL_OUTPUT(:,:,:)
integer*8:: plan
integer::iret, nthreads
INTEGER:: indiceX, indiceY, indiceZ, window=2
!! TESTING 3D FFTW with OPENMP
ALLOCATE(A3D(DD,DD,DD))
ALLOCATE(FINAL_OUTPUT(DD-2,DD-2,DD-2))
write(*,*) '---------------'
write(*,*) '------------TEST 3D FFTW WITH OPENMP----------'
A3D = reshape((/(i, i=1,DD*DD*DD)/),shape(A3D))
CALL dfftw_init_threads(iret)
CALL dfftw_plan_with_nthreads(nthreads)
CALL dfftw_plan_dft_3d(plan, 3,3,3, OUTPUT_FFTW, OUTPUT_FFTW, FFTW_FORWARD, FFTW_ESTIMATE)
FINAL_OUTPUT=0.
!$OMP PARALLEL DO DEFAULT(SHARED) SHARED(A3D,plan,window) &
!$OMP PRIVATE(indiceX, indiceY, indiceZ, OUTPUT_FFTW, FINAL_OUTPUT)
DO indiceZ=1,10!500-window
write(*,*) 'INDICE Z=', indiceZ
DO indiceY=1,10!500-window
DO indiceX=1,10!500-window
CALL dfftw_execute_dft(plan, A3D(indiceX:indiceX+window,indiceY:indiceY+window, indiceZ:indiceZ+window), OUTPUT_FFTW)
FINAL_OUTPUT(indiceX,indiceY,indiceZ)=SUM(ABS(OUTPUT_FFTW))
ENDDO
ENDDO
ENDDO
!$OMP END PARALLEL DO
call dfftw_destroy_plan(plan)
CALL dfftw_cleanup_threads()
DEALLOCATE(A3D,FINAL_OUTPUT)
END PROGRAM test_fftw
Notice this error occurs when I just use a huge matrix(A3D) without running the loop in all the values of this matrix (for running in all values, I should have the limits of the three (nested) loops as 500-window.
I tried to solve this(tips here and here) with -mcmodel=medium in the compilation without success.
I had success when I compiled with gfortran -o test teste_fftw_openmp.f90 -I/usr/local/include -L/usr/lib/x86_64-linux-gnu -lfftw3_omp -lfftw3 -lm -fopenmp -fmax-stack-var-size=65536
So, I don't understand:
1) Why there is memory allocation problem, if the huge matrix is a shared variable?
2) The solution I found is going to work if I have more huge matrix variables? For example, 3 more matrixes 500 x 500 x 500 to store calculation results.
3) In the tips I found, people said that using allocatable arrays/matrixes would solve, but I was using without any difference. Is there anything else I need to do for this?
Two double complex arrays with 500 x 500 x 500 elements require 4 gigabytes of memory. It is likely that the amount of available memory in your computer is not sufficient.
If you only work with small windows, you might consider not using the whole array at the whole time, but only parts of it. Or distribute the computation across multiple computers using MPI.
Or just use a computer with bigger RAM.
I am new to OpenMP and I just tried to write a small program with the parallel for construct. I have trouble understanding the output of my program. I don't understand why thread number 3 prints the output before 1 and 2. Could someone offer me an explanation?
So, the program is:
#pragma omp parallel for
for (i = 0; i < 7; i++) {
printf("We are in thread number %d and are printing %d\n",
omp_get_thread_num(), i);
}
and the output is:
We are in thread number 0 and are printing 0
We are in thread number 0 and are printing 1
We are in thread number 3 and are printing 6
We are in thread number 1 and are printing 2
We are in thread number 1 and are printing 3
We are in thread number 2 and are printing 4
We are in thread number 2 and are printing 5
My processor is a Intel(R) Core(TM) i5-2410M CPU with 4 cores.
Thank you!
OpenMP makes no guarantees of the relative ordering, in time, of the execution of statements by different threads. OpenMP leaves it to the programmer to impose such ordering if it is required. In general it is not required, in many cases not even desirable, which is why OpenMP's default behaviour is as it is. The cost, in time, of imposing such an ordering is likely to be significant.
I suggest you run much larger tests several times, you should observe that the cross-thread sequencing of events is, essentially, random.
If you want to print in order then you can use the ordered construct
#pragma omp parallel for ordered
for (i = 0; i < 7; i++) {
#pragma omp ordered
printf("We are in thread number %d and are printing %d\n",
omp_get_thread_num(), i);
}
I assume this requires threads from larger iterations to wait for the ones with lower iteration so it will have an effect on performance. You can see it used here http://bisqwit.iki.fi/story/howto/openmp/#ExampleCalculatingTheMandelbrotFractalInParallel
That draws the Mandelbrot set as characters using ordered. A much faster solution than using ordered is to fill an array in parallel of the characters and then draw them serially (try the code). Since one uses OpenMP for performance I have never found a good reason to use ordered but I'm sure it has its use somewhere.