Running time scales with the number of threads when running a function received from Python inside OpenMP parallel block - openmp

Here are the files for test.
# CMakeLists.txt
cmake_minimum_required(VERSION 3.16)
project(CALLBACK_TEST)
set(CMAKE_CXX_STANDARD 17)
add_compile_options(-O3 -fopenmp -fPIC)
add_link_options(-fopenmp)
add_subdirectory(pybind11)
pybind11_add_module(callback callback.cpp)
add_custom_command(TARGET callback POST_BUILD
COMMAND ${CMAKE_COMMAND} -E create_symlink $<TARGET_FILE:callback> ${CMAKE_CURRENT_SOURCE_DIR}/callback.so
)
// callback.cpp
#include <cmath>
#include <functional>
#include <vector>
#include <pybind11/pybind11.h>
#include <pybind11/functional.h>
namespace py = pybind11;
class C
{
public:
C(std::function<float(float)> f, size_t s) : f_(f), v_(s, 1) {}
void apply()
{
#pragma omp parallel for
for (size_t i = 0; i < v_.size(); i++)
v_[i] = f_(v_[i]);
}
void apply_direct()
{
#pragma omp parallel for
for (size_t i = 0; i < v_.size(); i++)
v_[i] = log(1 + v_[i]);
}
private:
std::vector<float> v_;
std::function<float(float)> f_;
};
PYBIND11_MODULE(callback, m)
{
py::class_<C>(m, "C")
.def(py::init<std::function<float(float)>, size_t>())
.def("apply", &C::apply, py::call_guard<py::gil_scoped_release>())
.def("apply_direct", &C::apply_direct);
m.def("log1p", [](float x) -> float
{ return log(1 + x); });
}
# callback.py
import math
import time
from callback import C, log1p
def run(n, func):
start = time.time()
if func:
for _ in range(n):
c = C(func, 1000)
c.apply()
else:
for _ in range(n):
c = C(func, 1000)
c.apply_direct()
end = time.time()
print(end - start)
if __name__ == "__main__":
n = 1000
one = 1
print("Python")
run(n, lambda x: math.log(x + 1))
print("C++")
run(n, log1p)
print("Direct")
run(n, None)
I run the Python script on a server with 48 CPU cores. Here is the running time. It shows 1. the running time increases when OMP_NUM_THREADS increases especially when accepting the Python/C++ callback from Python, and 2. keeping everything inside C++ is much faster, which seems to contradict the "no overhead" claim as in the documentation.
$ python callback.py
Python
19.612852573394775
C++
19.268250226974487
Direct
0.04382634162902832
$ OMP_NUM_THREADS=4 python callback.py
Python
6.042902708053589
C++
5.48648738861084
Direct
0.03322458267211914
$ OMP_NUM_THREADS=1 python callback.py
Python
0.5964927673339844
C++
0.38849639892578125
Direct
0.020793914794921875
And when OpenMP is turned off:
$ python callback.py
Python
0.8492450714111328
C++
0.26660943031311035
Direct
0.010872125625610352
So what goes wrong here?

There are several issues in your code.
First of all, the OpenMP parallel region should have a significant overhead here since it needs to share the work between 48 threads. This work-sharing can be quite expensive on some platform regarding the scheduling policy. You need to use schedule(static) to minimize this overhead. In the worst case, a runtime could create 48 threads and join them every time which is expensive. Creating/Joining 48*1000 threads would be very expensive (it should take at least several seconds). The higher the number of thread, the slower the program. That being said, most runtimes try to keep an active pool of threads. Still, this is not always possible (and this is an optimization, not required by the specification). Note that most OpenMP runtimes detect the case where OMP_NUM_THREADS is set to 1 so to have a very low overhead in this case. The general rule of thumb is to avoid using multithreading for very short operations like one taking less than 1 ms.
Moreover, the parallel for loop is subject to false sharing. Indeed, the vector of 1000 float items will take 4000 bytes in memory and it will be spread in 63 cache lines of 64 bytes on mainstream platforms. With 48 threads, almost all cache lines have to move between cores which is expensive compared to the computation done. When two threads working on adjacent cache line have an interleaved execution, a cache line can bounce many times for just few iteration. On NUMA architecture, this is even more expensive since cache lines have to move between NUMA nodes. Doing this 1000 times is very expensive.
Additionally, AFAIK calling a python function from a parallel context is either not safe, or is subject to no speed-up because of the global interpreter lock (GIL). By not safe, I mean that the CPython interpreter data structure can be corrupted causing non-deterministic crashes. This is why the GIL exists. The GIL prevent all code to scale on multiple thread as long as it is not released. Releasing a GIL for a too short period also cause cache line bouncing which is detrimental for performance (more than using a sequential code).
Finally, the "C++" and Python have a much bigger overhead than the "direct" method because they are calling dynamically-defined functions that cannot be inlined or vectorized by the compiler. Python functions are especially slow because of the CPython interpreter. If you want to make a fair benchmark you need to compare the PyBind solution with one that use std::function (be careful about clever compiler optimizations though).

Related

Is there a way to efficiently synchronize a subset of theads using OpenMP?

Because OpenMP nested parallelism has often performance problems, I was wondering if there is a way to implement a partial barrier that would synchronize only a subset of all threads.
Here is an example code structure:
#pragma omp parallel
{
int nth = omp_get_num_threads();
int tid = omp_get_thread_num();
if (tid < nth/2) {
// do some work
...
// I need some synchronization here, but only for nth/2 threads
#pragma omp partial_barrier(nth/2)
// do some more work
...
} else {
// do some other independent work
...
}
}
I did not find anything like that in the openmp standard, but maybe there is a way to efficiently program a similar behaviour with locks or something?
EDIT:
So my actual problem is that I have a computation kernel (a Legendre transform -- part A) that is efficiently parallelized using OpenMP with 4 to 12 threads depending on the problem size.
This computation is followed by a Fast Fourier transform (part B).
I have several independent datasets (typically about 6 to 10) that have to be processed by A followed by B.
I would like to use more parallelism with more threads (48 to 128, depending on the machines).
Since A is not efficiently parallelized with more than 4 to 12 threads, the idea is to split the threads into several groups, each group working on an independent dataset. Because the datasets are independent, I don't need to synchronize all the threads (which is quite expensive when many threads are used) before doing B, only the subset working on a given dataset.
OpenMP tasks with depenencies would do what I need, but my experience is that on some platforms (xeon servers) the performance is significantly lower to what you can get with simple threads.
Is there a way to synchronize a subset of threads efficiently?

Turn the code into a code using SIMD instructions

I am preparing for an exam and are doing some exercises without facit. So I am been giving this code and are wondering if I have turned the code into SIMD instructions.
The code
int A[100000];
int B[100000];
int C=0;
for int(i=0; i < 100000; i++)
C += A[i] * B[i];
Since there is no remainder, we don't need to take care of it. We also assume that it is a 128 bit register, and therefore can calculate 4 single precision floating point values.
My result - using SIMD
int A[100000];
int B[100000];
int C=0;
for int(i=0; i < 100000/4; i += 4)
C += A[i] * B[i];
C += A[i+1] * B[i+1];
C += A[i+2] * B[i+2];
C += A[i+3] * B[i+3];
What advantages can you see for using SIMD instructions instead of writing programs with multiple threads?
Assuming the omitted curly braces on your second loop is simply a typo, and typo in the for loop, and the fact that you ask about multiplying floats but your code shows arrays of ints, this won't get great vectorisation even if the compiler sees it. While the compiler might do the loads of 4 values from A and B as a single instruction each, and do the 4 multiplies in one instruction, your code forces the compiler to then extract each of the 4 products and sum them sequentially, and getting individual values out of a SIMD register is typically quite slow.
If on the other hand you did this
float A[100000];
float B[100000];
float C0=0, C1=0, C2=0, C3=0;
for (size_t i=0; i < 100000/4; i += 4)
{
C0 += A[i+0] * B[i+0];
C1 += A[i+1] * B[i+1];
C2 += A[i+2] * B[i+2];
C3 += A[i+3] * B[i+3];
}
float C = (C0 + C1) + (C2 + C3);
Then a good compiler could vectorise this as now it sees that within each loop it loads two SIMD registers, multiplies them, then it can add the result to a SIMD register of the sums, and only extracts those 4 sums and sums them all at the end.
A vectorising compile can do this with SIMD and it will not change the order of evaluation of individual sums (FP maths is NOT associative). The compiler is typically not allowed to change the order of FP maths for this reason (not without some extra flags that allow it to technically breach the language standards), so the code above can be precisely represented by SIMD instructions, and will run much faster (in fact I'd unwind the loop a further stage as the multiplication will be a bottleneck as it stands).
This is sort of the trick with SIMD, you have to understand and then think how the operation would be best implemented with vector instructions, and then write your code to execute the same sequence of operations, and hope the compiler spots what you've done.
Or you can write the vector instructions yourself with intrinsics, or use OpenMP or similar to tell the compiler more explicitly what to do.
Amongst the advantages of SIMD over threads for such an operation is the fact that you're making use of more of the silicon within a single core... so you're not preventing another thread from getting cycles. On our compute grid we typically run many single threaded processes on any one machine to keep all the cores busy at all times... in such a case doing this sum using more cores is a false economy, you'd simply be stealing cycles that another thread could usefully be running another job.
Yes, the provided code should compile into SIMD instructions with capable CPUs and compilers.
On vector-capable processors, SIMD exposes hardware features that greatly accelerate identical, parallel computations. For instance, SIMD typically makes better use of the cache on a single core due to streaming RAM access, assuming the data being processed is localized in contiguous areas of memory. Using multiprocessing, cache competition and other synchronization overhead could actually reduce performance as the various cores attempt to write data simultaneously. This is in addition to the intrinsic boost on von-Neumann machines from only having to read one, not four, separate instructions from the shared system memory.
The logic to do these arithmetic operations in parallel is always present, but requires specific SIMD instructions to utilize. As a result, SIMD tends to be used in hot loops where hand tuning makes overall optimization sense.

OpenMP slower reduction

There are two versions of openmp codes with reduction and without.
// with reduction
#pragma omp parallel for reduction(+:sum)
for (i=1;i<= num_steps; i++){
x = (i-0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
// without reduction
#pragma omp parallel private(i)
{
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
double x;
double partial_sum = 0;
for (i=id;i< num_steps; i+=numthreads){
x = (i+0.5)*step;
partial_sum += + 4.0/(1.0+x*x);
}
#pragma omp critical
sum += partial_sum;
}
I run the codes using 8 cores, the total time double for the reduction version. What's the reason? Thanks.
Scalar reduction in OpenMP is usually quite fast. The observed behaviour in your case is due to two things made wrong in two different ways.
In your first code you did not make x private. Therefore it is shared among the threads and besides getting incorrect results, the execution suffers from the data sharing. Whenever one thread writes to x, the core that it executes on sends a message to all other cores and makes them invalidate their copies of that cache line. When any of them writes to x later, the whole cache line has to be reloaded and then the cache lines in all other cores get invalidated. And so forth. This slows things down significantly.
In your second code you have used the OpenMP critical construct. This is a relatively heavy-weight in comparison with the atomic adds, usually used to implement the reduction at the end. Atomic adds on x86 are performed using the LOCK instruction prefix and everything gets implemented in the hardware. On the other side, critical sections are implemented using mutexes and require several instructions and often busy waiting loops. This is far less efficient than the atomic adds.
In the end, your first code is slowed down due to bad data sharing condition. Your second code is slowed down due to the use of incorrect synchronisation primitive. It just happens that on your particular system the latter effect is less severe than the former and hence your second example runs faster.
If you want to manually parallelize the loop as well as the reduction you can do it like this:
#pragma omp parallel private(i)
{
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
int start = id*num_steps/numthreads;
int finish = (id+1)*num_steps/numthreads;
double x;
double partial_sum = 0;
for (i=start; i<finish ; i++){
x = (i+0.5)*step;
partial_sum += + 4.0/(1.0+x*x);
}
#pragma omp atomic
sum += partial_sum;
}
However, I don't recommend this. Reductions don't have to be done with atomic and you should just let OpenMP parallelize the loop. The first case is the best solution (but make sure you declare x private).
Edit: According to Hristo once you make x private these two methods are nearlly the same in speed. I want to explain why using critical in your second method instead of atomic or allowing OpenMP to do the reduction has hardly any effect on the performance in this case.
There are two ways I can think of doing a reduction:
Sum the partial sums linearly using atomic or critical
Sum the partial sums using a tree. I.e. if you have 8 cores this gives you eight partial sums you reduce this to 4 partial sums then 2 partial sums then 1.
The first cast has linear convergence in the number of cores. The second case goes as the log of the number of cores. So one my be temped to think the second case is always better. However, for only eight cores the reduction is entirely dominated by taking the partial sums. Adding eight numbers with atomic/critical vs. reducing the tree in 3 steps will be negligable.
What if you have e.g. 1024 cores? Then the tree can be reduced in only 10 steps and the linear sum takes 1024 steps. But the constant term can be much larger for the second case and doing the partial sum of a large array e.g. with 1 million elements probably still dominates the reduction.
So I suspect that using atomic or even critical for a reduction has a negligable effect on the reduction time in general.

Why is this simple OpenCL kernel running so slowly?

I'm looking into OpenCL, and I'm a little confused why this kernel is running so slowly, compared to how I would expect it to run. Here's the kernel:
__kernel void copy(
const __global char* pSrc,
__global __write_only char* pDst,
int length)
{
const int tid = get_global_id(0);
if(tid < length) {
pDst[tid] = pSrc[tid];
}
}
I've created the buffers in the following way:
char* out = new char[2048*2048];
cl::Buffer(
context,
CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY,
length,
out);
Ditto for the input buffer, except that I've initialized the in pointer to random values. Finally, I run the kernel this way:
cl::Event event;
queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange(length),
cl::NDRange(1),
NULL,
&event);
event.wait();
On average, the time is around 75 milliseconds, as calculated by:
cl_ulong startTime = event.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_ulong endTime = event.getProfilingInfo<CL_PROFILING_COMMAND_END>();
std::cout << (endTime - startTime) * SECONDS_PER_NANO / SECONDS_PER_MILLI << "\n";
I'm running Windows 7, with an Intel i5-3450 chip (Sandy Bridge architecture). For comparison, the "direct" way of doing the copy takes less than 5 milliseconds. I don't think the event.getProfilingInfo includes the communication time between the host and device. Thoughts?
EDIT:
At the suggestion of ananthonline, I changed the kernel to use float4s instead of chars, and that dropped the average run time to about 50 millis. Still not as fast as I would have hoped, but an improvement. Thanks ananthonline!
I think your main problem is the 2048*2048 work groups you are using. The opencl drivers on your system have to manage a lot more overhead if you have this many single-item work groups. This would be especially bad if you were to execute this program using a gpu, because you would get a very low level of saturation of the hardware.
Optimization: call your kernel with larger work groups. You don't even have to change your existing kernel. see question: What should this size be? I have used 64 below as an example. 64 happens to be a decent number on most hardware.
cl::size_t myOptimalGroupSize = 64;
cl::Event event;
queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange(length),
cl::NDRange(myOptimalGroupSize),
NULL,
&event);
event.wait();
You should also get your kernel to do more than copy a single value. I have given an answer to a similar question about global memory over here.
CPUs are very different from GPUs. Running this on an x86 CPU, the best way to achieve decent performance would be to use double16 (the largest data type) instead of char or float4 (as suggested by someone else).
In my little experience with OpenCL on CPU, I have never reached performance levels that I could get with an OpenMP parallelization.
The best way to do a copy in parallel with a CPU would be to divide the block to copy into a small number of large sub-block, and let each thread copy a sub-block.
The GPU approach is orthogonal: each thread participates in the copy of the same block.
This is because on GPUs, different thread can access contiguous memory regions efficicently (coalescing).
To do an efficient copy on CPU with OpenCL, use a loop inside your kernel to copy contiguous data. And then use a workgroup size not larger than the number of available cores.
I believe it is the cl::NDRange(1) which is telling the runtime to use single item work groups. This is not efficient. In the C API you can pass NULL for this to leave the work group size up to the runtime; there should be a way to do that in the C++ API as well (perhaps also just NULL). This should be faster on the CPU; it certainly will be on a GPU.

OpenMP slows down program instead of speeding it up: a bug in gcc?

I will first give some background about the problem I'm having so you know what I'm trying to do. I have been helping out with the development of a certain software tool and found out that we could benefit greatly from using OpenMP to parallelize some of the biggest loops in this software. We actually parallelized the loops successfully and with just two cores the loops executed 30% faster, which was an OK improvement. On the other hand we noticed a weird phenomenom in a function that traverses through a tree structure using recursive calls. The program actually slowed down here with OpenMP on and the execution time of this function over doubled. We thought that maybe the tree-structure was not balanced enough for parallelization and commented out the OpenMP pragmas in this function. This appeared to have no effect on the execution time though. We are currently using GCC-compiler 4.4.6 with the -fopenmp flag on for OpenMP support. And here is the current problem:
If we don't use any omp pragmas in the code, all runs fine. But if we add just the following to the beginning of the program's main function, the execution time of the tree travelsal function over doubles from 35 seconds to 75 seconds:
//beginning of main function
...
#pragma omp parallel
{
#pragma omp single
{}
}
//main function continues
...
Does anyone have any clues about why this happens? I don't understand why the program slows down so greatly just from using the OpenMP pragmas. If we take off all the omp pragmas, the execution time of the tree traversal function drops back to 35 seconds again. I would guess that this is some sort of compiler bug as I have no other explanation on my mind right now.
Not everything that can be parallelized, should be parallelized. If you are using a single, then only one thread executes it and the rest have to wait until the region is done. They can either spin-wait or sleep. Most implementations start out with a spin-wait, hoping that the single region will not take too long and the waiting threads can see the completion faster than if sleeping. Spin-waits eat up a lot of processor cycles. You can try specifying that the wait should be passive - but this is only in OpenMP V3.0 and is only a hint to the implementation (so it might not have any effect). Basically, unless you have a lot of work in the parallel region that can compensate for the single, the single is going to increase the parallel overhead substantially and may well make it too expensive to parallelize.
First, OpenMP often reduces performance on first try. It can be tricky to to use omp parallel if you don't understand it inside-out. I may be able to help if you can you tell me a little more about the program structure, specifically the following questions annotated by ????.
//beginning of main function
...
#pragma omp parallel
{
???? What goes here, is this a loop? if so, for loop, while loop?
#pragma omp single
{
???? What goes here, how long does it run?
}
}
//main function continues
....
???? Does performance of this code reduce or somewhere else?
Thanks.
Thank you everyone. We were able to fix the issue today by linking with TCMalloc, one of the solutions ejd offered. The execution time dropped immediately and we were able to get around 40% improvement in execution times over a non-threaded version. We used 2 cores. It seems that when using OpenMP on Unix with GCC, you should also pick a replacement for the standard memory management solution. Otherwise the program may just slow down.
I did some more testing and made a small test program to test whether the issue could be memory operation related. I was unable to replicate the issue of an empty parallel-single region causing program to slow down in my small test program, but I was able to replicate the slow down by parallelizing some malloc calls.
When running the test program on Windows 7 64-bit with 2 CPU-cores, no noticeable slow down was caused by using -fopenmp flag with the gcc (g++) compiler and running the compiled program compared to running the program without OpenMP support.
Doing the same on Kubuntu 11.04 64-bit on the same computer, however, raised the execution to over 4 times of the non-OpenMP version. This issue seems to only appear on Unix-systems and not on Windows.
The source of my test program is below. I have also uploaded zipped-source for win and unix version as well as assembly source for win and unix version for both with and without OpenMP-support. This zip can be downloaded here http://www.2shared.com/file/0thqReHk/omp_speed_test_2011_05_11.html
#include <stdio.h>
#include <windows.h>
#include <list>
#include <sys/time.h>
//#include <cstdlib>
using namespace std;
int main(int argc, char* argv[])
{
// #pragma omp parallel
// #pragma omp single
// {}
int start = GetTickCount();
/*
struct timeval begin, end;
int usecs;
gettimeofday(&begin, NULL);
*/
list<void *> pointers;
#pragma omp parallel for default(shared)
for(int i=0; i< 10000; i++)
//pointers.push_back(calloc(20000, sizeof(void *)));
pointers.push_back(malloc(20000));
for(list<void *>::iterator i = pointers.begin(); i!= pointers.end(); i++)
free(*i);
/*
gettimeofday(&end, NULL);
if (end.tv_usec < begin.tv_usec) {
end.tv_usec += 1000000;
begin.tv_sec += 1;
}
usecs = (end.tv_sec - begin.tv_sec) * 1000000;
usecs += (end.tv_usec - begin.tv_usec);
*/
printf("It took %d milliseconds to finish the memory operations", GetTickCount() - start);
//printf("It took %d milliseconds to finish the memory operations", usecs/1000);
return 0;
}
What remains unanswered now is, what can I do to avoid issues such as these on the Unix-platform..

Resources