OpenMP slows down program instead of speeding it up: a bug in gcc? - performance

I will first give some background about the problem I'm having so you know what I'm trying to do. I have been helping out with the development of a certain software tool and found out that we could benefit greatly from using OpenMP to parallelize some of the biggest loops in this software. We actually parallelized the loops successfully and with just two cores the loops executed 30% faster, which was an OK improvement. On the other hand we noticed a weird phenomenom in a function that traverses through a tree structure using recursive calls. The program actually slowed down here with OpenMP on and the execution time of this function over doubled. We thought that maybe the tree-structure was not balanced enough for parallelization and commented out the OpenMP pragmas in this function. This appeared to have no effect on the execution time though. We are currently using GCC-compiler 4.4.6 with the -fopenmp flag on for OpenMP support. And here is the current problem:
If we don't use any omp pragmas in the code, all runs fine. But if we add just the following to the beginning of the program's main function, the execution time of the tree travelsal function over doubles from 35 seconds to 75 seconds:
//beginning of main function
...
#pragma omp parallel
{
#pragma omp single
{}
}
//main function continues
...
Does anyone have any clues about why this happens? I don't understand why the program slows down so greatly just from using the OpenMP pragmas. If we take off all the omp pragmas, the execution time of the tree traversal function drops back to 35 seconds again. I would guess that this is some sort of compiler bug as I have no other explanation on my mind right now.

Not everything that can be parallelized, should be parallelized. If you are using a single, then only one thread executes it and the rest have to wait until the region is done. They can either spin-wait or sleep. Most implementations start out with a spin-wait, hoping that the single region will not take too long and the waiting threads can see the completion faster than if sleeping. Spin-waits eat up a lot of processor cycles. You can try specifying that the wait should be passive - but this is only in OpenMP V3.0 and is only a hint to the implementation (so it might not have any effect). Basically, unless you have a lot of work in the parallel region that can compensate for the single, the single is going to increase the parallel overhead substantially and may well make it too expensive to parallelize.

First, OpenMP often reduces performance on first try. It can be tricky to to use omp parallel if you don't understand it inside-out. I may be able to help if you can you tell me a little more about the program structure, specifically the following questions annotated by ????.
//beginning of main function
...
#pragma omp parallel
{
???? What goes here, is this a loop? if so, for loop, while loop?
#pragma omp single
{
???? What goes here, how long does it run?
}
}
//main function continues
....
???? Does performance of this code reduce or somewhere else?
Thanks.

Thank you everyone. We were able to fix the issue today by linking with TCMalloc, one of the solutions ejd offered. The execution time dropped immediately and we were able to get around 40% improvement in execution times over a non-threaded version. We used 2 cores. It seems that when using OpenMP on Unix with GCC, you should also pick a replacement for the standard memory management solution. Otherwise the program may just slow down.

I did some more testing and made a small test program to test whether the issue could be memory operation related. I was unable to replicate the issue of an empty parallel-single region causing program to slow down in my small test program, but I was able to replicate the slow down by parallelizing some malloc calls.
When running the test program on Windows 7 64-bit with 2 CPU-cores, no noticeable slow down was caused by using -fopenmp flag with the gcc (g++) compiler and running the compiled program compared to running the program without OpenMP support.
Doing the same on Kubuntu 11.04 64-bit on the same computer, however, raised the execution to over 4 times of the non-OpenMP version. This issue seems to only appear on Unix-systems and not on Windows.
The source of my test program is below. I have also uploaded zipped-source for win and unix version as well as assembly source for win and unix version for both with and without OpenMP-support. This zip can be downloaded here http://www.2shared.com/file/0thqReHk/omp_speed_test_2011_05_11.html
#include <stdio.h>
#include <windows.h>
#include <list>
#include <sys/time.h>
//#include <cstdlib>
using namespace std;
int main(int argc, char* argv[])
{
// #pragma omp parallel
// #pragma omp single
// {}
int start = GetTickCount();
/*
struct timeval begin, end;
int usecs;
gettimeofday(&begin, NULL);
*/
list<void *> pointers;
#pragma omp parallel for default(shared)
for(int i=0; i< 10000; i++)
//pointers.push_back(calloc(20000, sizeof(void *)));
pointers.push_back(malloc(20000));
for(list<void *>::iterator i = pointers.begin(); i!= pointers.end(); i++)
free(*i);
/*
gettimeofday(&end, NULL);
if (end.tv_usec < begin.tv_usec) {
end.tv_usec += 1000000;
begin.tv_sec += 1;
}
usecs = (end.tv_sec - begin.tv_sec) * 1000000;
usecs += (end.tv_usec - begin.tv_usec);
*/
printf("It took %d milliseconds to finish the memory operations", GetTickCount() - start);
//printf("It took %d milliseconds to finish the memory operations", usecs/1000);
return 0;
}
What remains unanswered now is, what can I do to avoid issues such as these on the Unix-platform..

Related

Running time scales with the number of threads when running a function received from Python inside OpenMP parallel block

Here are the files for test.
# CMakeLists.txt
cmake_minimum_required(VERSION 3.16)
project(CALLBACK_TEST)
set(CMAKE_CXX_STANDARD 17)
add_compile_options(-O3 -fopenmp -fPIC)
add_link_options(-fopenmp)
add_subdirectory(pybind11)
pybind11_add_module(callback callback.cpp)
add_custom_command(TARGET callback POST_BUILD
COMMAND ${CMAKE_COMMAND} -E create_symlink $<TARGET_FILE:callback> ${CMAKE_CURRENT_SOURCE_DIR}/callback.so
)
// callback.cpp
#include <cmath>
#include <functional>
#include <vector>
#include <pybind11/pybind11.h>
#include <pybind11/functional.h>
namespace py = pybind11;
class C
{
public:
C(std::function<float(float)> f, size_t s) : f_(f), v_(s, 1) {}
void apply()
{
#pragma omp parallel for
for (size_t i = 0; i < v_.size(); i++)
v_[i] = f_(v_[i]);
}
void apply_direct()
{
#pragma omp parallel for
for (size_t i = 0; i < v_.size(); i++)
v_[i] = log(1 + v_[i]);
}
private:
std::vector<float> v_;
std::function<float(float)> f_;
};
PYBIND11_MODULE(callback, m)
{
py::class_<C>(m, "C")
.def(py::init<std::function<float(float)>, size_t>())
.def("apply", &C::apply, py::call_guard<py::gil_scoped_release>())
.def("apply_direct", &C::apply_direct);
m.def("log1p", [](float x) -> float
{ return log(1 + x); });
}
# callback.py
import math
import time
from callback import C, log1p
def run(n, func):
start = time.time()
if func:
for _ in range(n):
c = C(func, 1000)
c.apply()
else:
for _ in range(n):
c = C(func, 1000)
c.apply_direct()
end = time.time()
print(end - start)
if __name__ == "__main__":
n = 1000
one = 1
print("Python")
run(n, lambda x: math.log(x + 1))
print("C++")
run(n, log1p)
print("Direct")
run(n, None)
I run the Python script on a server with 48 CPU cores. Here is the running time. It shows 1. the running time increases when OMP_NUM_THREADS increases especially when accepting the Python/C++ callback from Python, and 2. keeping everything inside C++ is much faster, which seems to contradict the "no overhead" claim as in the documentation.
$ python callback.py
Python
19.612852573394775
C++
19.268250226974487
Direct
0.04382634162902832
$ OMP_NUM_THREADS=4 python callback.py
Python
6.042902708053589
C++
5.48648738861084
Direct
0.03322458267211914
$ OMP_NUM_THREADS=1 python callback.py
Python
0.5964927673339844
C++
0.38849639892578125
Direct
0.020793914794921875
And when OpenMP is turned off:
$ python callback.py
Python
0.8492450714111328
C++
0.26660943031311035
Direct
0.010872125625610352
So what goes wrong here?
There are several issues in your code.
First of all, the OpenMP parallel region should have a significant overhead here since it needs to share the work between 48 threads. This work-sharing can be quite expensive on some platform regarding the scheduling policy. You need to use schedule(static) to minimize this overhead. In the worst case, a runtime could create 48 threads and join them every time which is expensive. Creating/Joining 48*1000 threads would be very expensive (it should take at least several seconds). The higher the number of thread, the slower the program. That being said, most runtimes try to keep an active pool of threads. Still, this is not always possible (and this is an optimization, not required by the specification). Note that most OpenMP runtimes detect the case where OMP_NUM_THREADS is set to 1 so to have a very low overhead in this case. The general rule of thumb is to avoid using multithreading for very short operations like one taking less than 1 ms.
Moreover, the parallel for loop is subject to false sharing. Indeed, the vector of 1000 float items will take 4000 bytes in memory and it will be spread in 63 cache lines of 64 bytes on mainstream platforms. With 48 threads, almost all cache lines have to move between cores which is expensive compared to the computation done. When two threads working on adjacent cache line have an interleaved execution, a cache line can bounce many times for just few iteration. On NUMA architecture, this is even more expensive since cache lines have to move between NUMA nodes. Doing this 1000 times is very expensive.
Additionally, AFAIK calling a python function from a parallel context is either not safe, or is subject to no speed-up because of the global interpreter lock (GIL). By not safe, I mean that the CPython interpreter data structure can be corrupted causing non-deterministic crashes. This is why the GIL exists. The GIL prevent all code to scale on multiple thread as long as it is not released. Releasing a GIL for a too short period also cause cache line bouncing which is detrimental for performance (more than using a sequential code).
Finally, the "C++" and Python have a much bigger overhead than the "direct" method because they are calling dynamically-defined functions that cannot be inlined or vectorized by the compiler. Python functions are especially slow because of the CPython interpreter. If you want to make a fair benchmark you need to compare the PyBind solution with one that use std::function (be careful about clever compiler optimizations though).

OpenMP for-loop chunk scheduling visualization

Are there tools that visualize execution of OpenMP for-loop chunks?
For example, consider the parallel for-loop below:
#pragma omp parallel for schedule(dynamic, 10) num_threads(4)
for(int i=1; i<100; i++)
{
// do work of uneven execution time.
}
I want to visualize on which thread each of the 10 chunks (say (1,10),(11,20),...,(91,100)) executed and how long they took, without modifying code?
I understand that only four (one per thread) parallel outline functions are started, and that each of these functions ask for chunks in a synchronized manner. I can visualize the four parallel outline functions in tools such as Intel VTune, but am unable to drill this visualization down to the chunk level.
Thanks in advance for your tips and suggestions!

different OpenMP output in different machine

When I m trying to run the following code in my system centos running virtually i am getting right output but when i am trying to run the same code on compact supercomputer "Param Shavak" I am getting incorrect output.... :(
#include<stdio.h>
#include<omp.h>
int main()
{
int p=1,s=1,ti
#pragma omp parallel private(p,tid)shared(s)
{
p=1;
tid=omp_get_thread_num();
p=p+tid;
s=s+tid;
printf("Thread %d P=%d S=%d\n",tid,p,s);
}
return 0;
}
If your program runs correctly in one machine, it must be because it's actually not running in parallel in that machine.
Your program suffers from a race condition in the s=s+tid; line of code. s is a shared variable, so several threads at the same time try to update it, which results in data loss.
You can fix the problem by turning that line of code into an atomic operation:
#pragma omp atomic
s=s+tid;
That way only one thread at a time can read and update the variable s, and the race condition is no more.
In more complex programs you should use atomic operations or critical regions only when necessary, because you don't have parallelism in those regions and that hurts performance.
EDIT: As suggested by user High Performance Mark, I must remark that the program above is very inefficient because of the atomic operation. The proper way to do that kind of calculation (adding to the same variable in all iterations of a loop) is to implement a reduction. OpenMP makes it easy by using the reduction clause:
#pragma omp reduction(operator : variables)
Try this version of your program, using reduction:
#include<stdio.h>
#include<omp.h>
int main()
{
int p=1,s=1,tid;
#pragma omp parallel reduction(+:s) private(p,tid)
{
p=1;
tid=omp_get_thread_num();
p=p+tid;
s=s+tid;
printf("Thread %d P=%d S=%d\n",tid,p,s);
}
return 0;
}
The following link explains critical sections, atomic operations and reduction in a more verbose way: http://www.lindonslog.com/programming/openmp/openmp-tutorial-critical-atomic-and-reduction/

Fewer threads updating value than expected with omp_set_num_threads()

Why does this program print the result as 64 and not 5000? If the count variable is being updated in the critical section, I would expect that only one thread should have access to it at any given point in time. Thus, each thread would be able to increment the count, and produce the result 5000, so why do I get 64 instead in answer?
#include <iostream>
#include <omp.h>
using namespace std;
int main()
{
int count = 0;
omp_set_num_threads(5000);
#pragma omp parallel
{
#pragma omp critical
{
count++;
}
}
cout << "count = " << count << endl;
system("pause");
return 0;
}
As Michael Dussere points out, you're getting 64 as an answer because your implementation is only launching 64 threads. It may be using an internal default value to limit the max number of threads (try varying the environment variable OMP_THREAD_LIMIT, or calling omp_get_thread_limit() to see if that is the case.)
The reason for such a limit is that creating threads requires resources - each thread has to have its own stack space, process table entries on linux, etc. These aren't lightweight stateless Erlang threads that are scheduled in user space. On my 8-core system using gcc or icpc, setting the thread number to anything 1024 or above simply fails due to lack of resources, although setting system parameters can shift that limitation around.
Between the resources required by the threads and the fact that most single-image system have significantly fewer than 5000 cores, it's not clear what you'd be able to accomplish with 5000 threads on most systems.
The value you can set with omp_set_num_threads is not unlimitted.
It depends of the OpemMP implementation you use, the number of cores of you computer and so on.
You get 64 because there should be 64 threads in the current thread team. You can check with omp_get_num_threads.

Mathlink memory usage accumulation

I use MathLink to send and receive independent mma expressions from a C++ application as strings.
std::string expression[N];
// ...
for(int i = 0; i < N; ++i) {
MLPutFunction(l, "EnterTextPacket", 1);
MLPutString(l, expression[i].c_str());
MLEndPacket(l);
// Check Packet ...
const char* result;
MLGetString(l, &result);
// process result ...
MLDisownString(l, result);
}
I would expect that MLDisownString frees the used memory except that it doesn't.
Any ideas?
Ok. Posting this as an answer, because I believe the odds you are using version 5 or below are pretty low:
`As of Version 6.0, MLDisownString() has been superseded by MLReleaseString()`
Check it here
First of all, I should point out such parameter as $HistoryLength. Setting it to zero often allows to reduce memory requirements considerably:
$HistoryLength = 0
At the same time, it is known problem with the MathKernel process that it accumulates system memory in long computations and does not release it.
The only way to ultimately solve the problem it to restart the kernel when it takes too much memory or when the amount of available free physical memory becomes too small. This task can be automatized.
If you have not tried Mathematica 8 yet, it may be worth a try, since, according to Oliver Ruebenkoenig:
For version 8 the memory allocator has
been rewritten and improved.
(What a small sentence for such a huge
endeavor and such a fine execution)
But I have not tried the version 8 yet and cannot say anything on it.

Resources