Shared address space in Cython - openmp

I have the following Cython code:
from cython import parallel
from libc.stdio cimport printf
cdef extern from "unistd.h" nogil:
int usleep(int);
def test_func():
cdef int var = -1
with nogil, parallel.parallel(num_threads=4):
var = parallel.threadid()
printf("Var: %d\n", var)
After compiling and launching in python3 console I get:
>>> import test
>>> test.test_func()
Var: 3
Var: 0
Var: 2
Var: 1
That means, each thread has its own address space.
However, I would like to have C++ OpenMP behavior, duplicating this:
#include "omp.h"
#include <stdio.h>
#include <unistd.h>
int main() {
int v=-1;
#pragma omp parallel
printf("Var: %d\n", v);
which outputs something like this:
Var: 3
Var: 3
Var: 3
Var: 3
Thus, question: is it possible to get shared address space in Cython parallelized blocks?

It's a bit of a hacky way of doing it, but using a pointer to var you look to be able to trick it
def test_func2():
cdef int var = -1
cdef int* var_ptr = &var
with nogil, parallel.parallel(num_threads=4):
var_ptr[0] = parallel.threadid()
printf("Var: %d\n", var)
If you look at the generated C code then test_func (your version) gives the line
#pragma omp parallel private(__pyx_v_var) num_threads(4)
while test_func2 gives
#pragma omp parallel num_threads(4)
I think this works because you don't directly assign to it, so it isn't picked up by Cython's normal rules for what to make private. There is a risk here - if a future version of Cython got cleverer and make var_ptr private then it wouldn't be initialised at the start of the parallel section (so be careful and check what it's doing!).


how to using gtest in OpenMP parallel block?

I want to using gtest in openmp.
I want calculation something in parallel block, and check the thread private variable's result, and do more calculation.
Here is an example.
#include "gmock.h"
#include <iostream>
using namespace testing;
TEST(SimpleGtest, OpenMP)
#pragma omp parallel
// some thread private variable
int thread_index = omp_get_thread_num();
int z;
// some calculation
// ...
// check result of thread private variable
ASSERT_THAT(z, Eq(13));
// other calculation
// ...
int main(int argc, char **argv)
testing::InitGoogleTest(&argc, argv);
return RUN_ALL_TESTS();
but when I compile the code, the compiler complaint that:
error: "return" branches to or from an OpenMP structured block are illegal
ASSERT_THAT(z, Eq(13));
The code above is just a simple example.
I know we can not using "thread private variable" but using "thread shared variable" so we can do the assert out of parallel block.
But is there any solution that we can using gtest to check the thread private variable's result in OpenMP parallel block?
Thanks in advance!
do not using ASSERT_THAT in parallel block. EXPECT_EQ can help.

Pybind11: Possible to use mpi4py?

Is it possible in Pybind11 to use mpi4py on the Python side and then to hand over the communicator to the C++ side?
If so, how would it work?
If not, is it possible for example with Boost? And if so, how would it be done?
I searched the web literally for hours but didn't find anything.
Passing an mpi4py communicator to C++
using pybind11 can be done using the
mpi4py C-API. The corresponding header files can be located using the
following Python code:
import mpi4py
To convenietly pass communicators between Python and C++, a custom pybind11
type caster
can be implemented. For this purpose, we start with the typical preamble.
// native.cpp
#include <pybind11/pybind11.h>
#include <mpi.h>
#include <mpi4py/mpi4py.h>
namespace py = pybind11;
In order for pybind11 to automatically convert a Python type to a C++ type,
we need a distinct type that the C++ compiler can recognise. Unfortunately,
the MPI standard does not specify the type for MPI_comm. Worse, in
common MPI implementations MPI_comm can be defined as int or void*
which the C++ compiler cannot distinguish from regular use of these types.
To create a distinct type, we define a wrapper class for MPI_Comm which
implicitly converts to and from MPI_Comm.
struct mpi4py_comm {
mpi4py_comm() = default;
mpi4py_comm(MPI_Comm value) : value(value) {}
operator MPI_Comm () { return value; }
MPI_Comm value;
The type caster is then implemented as follows:
namespace pybind11 { namespace detail {
template <> struct type_caster<mpi4py_comm> {
PYBIND11_TYPE_CASTER(mpi4py_comm, _("mpi4py_comm"));
// Python -> C++
bool load(handle src, bool) {
PyObject *py_src = src.ptr();
// Check that we have been passed an mpi4py communicator
if (PyObject_TypeCheck(py_src, &PyMPIComm_Type)) {
// Convert to regular MPI communicator
value.value = *PyMPIComm_Get(py_src);
} else {
return false;
return !PyErr_Occurred();
// C++ -> Python
static handle cast(mpi4py_comm src,
return_value_policy /* policy */,
handle /* parent */)
// Create an mpi4py handle
return PyMPIComm_New(src.value);
}} // namespace pybind11::detail
Below is the code of an example module which uses the type caster. Note, that
we use mpi4py_comm instead of MPI_Comm in the function definitions
exposed to pybind11. However, due to the implicit conversion, we can use
these variables as regular MPI_Comm variables. Especially, they can be
passed to any function expecting an argument of type MPI_Comm.
// recieve a communicator and check if it equals MPI_COMM_WORLD
void print_comm(mpi4py_comm comm)
if (comm == MPI_COMM_WORLD) {
std::cout << "Received the world." << std::endl;
} else {
std::cout << "Received something else." << std::endl;
mpi4py_comm get_comm()
return MPI_COMM_WORLD; // Just return MPI_COMM_WORLD for demonstration
PYBIND11_MODULE(native, m)
// import the mpi4py API
if (import_mpi4py() < 0) {
throw std::runtime_error("Could not load mpi4py API.");
// register the test functions
m.def("print_comm", &print_comm, "Do something with the mpi4py communicator.");
m.def("get_comm", &get_comm, "Return some communicator.");
The module can be compiled, e.g., using
mpicxx -O3 -Wall -shared -std=c++14 -fPIC \
$(python3 -m pybind11 --includes) \
-I$(python3 -c 'import mpi4py; print(mpi4py.get_include())') \
native.cpp -o native$(python3-config --extension-suffix)
and tested using
import native
from mpi4py import MPI
import math
# Create a cart communicator for testing
# (MPI_COMM_WORLD.size has to be a square number)
d = math.sqrt(MPI.COMM_WORLD.size)
cart_comm = MPI.COMM_WORLD.Create_cart([d,d], [1,1], False)
print(f'native.get_comm() == MPI.COMM_WORLD '
f'-> {native.get_comm() == MPI.COMM_WORLD}')
The output should be:
Received the world.
Received something else.
native.get_comm() == MPI.COMM_WORLD -> True
This is indeed possible. As was pointed out in the comments by John Zwinck, MPI_COMM_WORLD will automatically point to the correct communicator, so nothing has to be passed from python to the C++ side.
First we have a simple pybind11 module that does expose a single function which simple prints some MPI information (taken from one of the many online tutorials). To compile the module see here pybind11 cmake example.
#include <pybind11/pybind11.h>
#include <mpi.h>
#include <stdio.h>
void say_hi()
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
printf("Hello world from processor %s, rank %d out of %d processors\n",
PYBIND11_MODULE(mpi_lib, pybind_module)
constexpr auto MODULE_DESCRIPTION = "Just testing out mpi with python.";
pybind_module.doc() = MODULE_DESCRIPTION;
pybind_module.def("say_hi", &say_hi, "Each process is allowed to say hi");
Next the python side. Here I reuse the example from this post: Hiding MPI in Python and simply put in the pybind11 library. So first the python script that will call the MPI python script:
import sys
import numpy as np
from mpi4py import MPI
def parallel_fun():
comm = MPI.COMM_SELF.Spawn(
args = [''],
N = np.array(0, dtype='i')
comm.Reduce(None, [N, MPI.INT], op=MPI.SUM, root=MPI.ROOT)
print(f'We got the magic number {N}')
And the child process file. Here we simply call the library function and it just works.
from mpi4py import MPI
import numpy as np
from mpi_lib import say_hi
comm = MPI.Comm.Get_parent()
N = np.array(comm.Get_rank(), dtype='i')
comm.Reduce([N, MPI.INT], None, op=MPI.SUM, root=0)
The end result is:
from prog import parallel_fun
# Hello world from processor arch_zero, rank 1 out of 4 processors
# Hello world from processor arch_zero, rank 2 out of 4 processors
# Hello world from processor arch_zero, rank 0 out of 4 processors
# Hello world from processor arch_zero, rank 3 out of 4 processors
# We got the magic number 6

Device not available error when running code on Intel MIC

When I try to run my code on Intel MIC it is giving an error like
"offload error: cannot offload to MIC - device is not available"
My sample code is
#include <stdio.h>
#include <omp.h>
int main()
int N=10;
int i, a[N];
#pragma offload target(mic)
#pragma omp parallel
#pragma omp for
for(i = 0; i < N; i++)
printf("a[%d] :: %d \n", i, a[i]);
return 0;
One of 2 things is happening. Either the card is not booted, you can check this by:
sudo micctrl -s
Or the runtime cannot find dependent libraries. This is most likely due to not sourcing the compiler environment variables:
source /opt/intel/composerxe/bin/ intel64
I believe you have not set up the compiler's environment.
Compiler Environment:
source /opt/intel/composerxe/bin/ intel64
Also set the offload library as well.
#include "offload.h"

Race conditions with OpenMP

I need to fill 2D array (tmp[Ny][Nx]) while each cell of the array gets an integral (of some function) as a function of free parameters. Since I deal with a very large arrays (here I simplified my case), I need to use OpenMP parallelism in order to speed my calculations up. Here I use simple #pragma omp parallel for directive.
Without using #pragma omp parallel for, the code executes perfectly. But adding the parallel directive, produces race conditions in the output.
I tried to cure it by making private(i,j,par), it did not help.
P.S. I use VS2008 Professional with OpenMP 2.0 and under WIndows 7 OS
Here is my code: (a short sample)
testfunc(const double* var, const double* par)
// here is some simple function to be integrated over
// var[0] and var[1] and two free parameters par[0] and par[1]
return ....
#define Nx 10000
#define Ny 10000
static double tmp[Ny][Nx];
int main()
double par[2]; // parameters
double xmin[]={0,0} // limits of 2D integration
double xmax[]={1,1};// limits of 2D integration
double val,Tol=1e-7,AbsTol=1e-7;
int i,j,NDim=2,NEval=1e5;
#pragma omp parallel for private(i,j,par,val)
for (i=0;i<Nx;i++)
for (j=0;j<Ny;j++)
adapt_integrate(testfunc,par, NDim, xmin, xmax,
NEval, Tol, AbsTol, &val, &err);
// adapt_integrate - receives my integrand, performs
// integration and returns a result through "val"
tmp[i][j] = val;
It produces race conditions at the output. I tried to avoid it by making all internal variables (i,j,par and val) private, but it doesn't help.
P.S. Serial version (#threads=1) of this code runs properly.
(Answered in the question. Converted to a community wiki answer. See Question with no answers, but issue solved in the comments (or extended in chat) )
The OP wrote:
The problem Solved!
I defined parameters of integration as global and used #pragma omp threadprivate(parGlob) directive for them. Now it works like a charm. I've been thinking that private() and threadprivate() have the same meaning, just different ways of implementations, but they do not.
So, playing with these directives may give a correct answer. Another thing is that defining iterator i inside the first for loop gives additional 20%-30% speed up in performance. So, the fastest version of the code looks now as:
testfunc(const double* var, const double* par)
#define Nx 10000
#define Ny 10000
static double tmp[Ny][Nx];
double parGlob[2]; //<- Here are they!!!
#pragma omp threadprivate(parGlob) // <-Magic directive!!!!
int main()
// Not here !!!! -> double par[2]; // parameters
double xmin[]={0,0} // limits of 2D integration
double xmax[]={1,1};// limits of 2D integration
double val,Tol=1e-7,AbsTol=1e-7;
int j,NDim=2,NEval=1e5;
#pragma omp parallel for private(j,val) // no `i` inside `private` clause
for (int i=0;i<Nx;i++)
for (j=0;j<Ny;j++)
adapt_integrate(testfunc,par, NDim, xmin, xmax,
NEval, Tol, AbsTol, &val, &err);
tmp[i][j] = val;

How to find the address & length of a C++ function at runtime (MinGW)

As this is my first post to stackoverflow I want to thank you all for your valuable posts that helped me a lot in the past.
I use MinGW (gcc 4.4.0) on Windows-7(64) - more specifically I use Nokia Qt + MinGW but Qt is not involved in my Question.
I need to find the address and -more important- the length of specific functions of my application at runtime, in order to encode/decode these functions and implement a software protection system.
I already found a solution on how to compute the length of a function, by assuming that static functions placed one after each other in a source-file, it is logical to be also sequentially placed in the compiled object file and subsequently in memory.
Unfortunately this is true only if the whole CPP file is compiled with option: "g++ -O0" (optimization level = 0).
If I compile it with "g++ -O2" (which is the default for my project) the compiler seems to relocate some of the functions and as a result the computed function length seems to be both incorrect and negative(!).
This is happening even if I put a "#pragma GCC optimize 0" line in the source file,
which is supposed to be the equivalent of a "g++ -O0" command line option.
I suppose that "g++ -O2" instructs the compiler to perform some global file-level optimization (some function relocation?) which is not avoided by using the #pragma directive.
Do you have any idea how to prevent this, without having to compile the whole file with -O0 option?
OR: Do you know of any other method to find the length of a function at runtime?
I prepare a small example for you, and the results with different compilation options, to highlight the case.
The Source:
// ===================================================================
// test.cpp
// Intention: To find the addr and length of a function at runtime
// Problem: The application output is correct when compiled with: "g++ -O0"
// but it's erroneous when compiled with "g++ -O2"
// (although a directive "#pragma GCC optimize 0" is present)
// ===================================================================
#include <stdio.h>
#include <math.h>
#pragma GCC optimize 0
static int test_01(int p1)
return 1;
static int test_02(int p1)
return 2;
static int test_03(int p1)
return 3;
static int test_04(int p1)
return 4;
// Print a HexDump of a specific address and length
void HexDump(void *startAddr, long len)
unsigned char *buf = (unsigned char *)startAddr;
printf("addr:%ld, len:%ld\n", (long )startAddr, len);
len = (long )fabs(len);
while (len)
printf("%02x.", *buf);
int main(int argc, char *argv[])
long fun_len = (long )test_02 - (long )test_01;
HexDump((void *)test_01, fun_len);
fun_len = (long )test_03 - (long )test_02;
HexDump((void *)test_02, fun_len);
fun_len = (long )test_04 - (long )test_03;
HexDump((void *)test_03, fun_len);
printf("Test End\n");
// Just a trick to block optimizer from eliminating test_xx() functions as unused
if (argc > 1)
The (correct) Output when compiled with "g++ -O0":
[note the 'c3' byte (= assembly 'ret') at the end of all functions]
addr:4199344, len:37
addr:4199381, len:49
addr:4199430, len:37
Test End
The erroneous Output when compiled with "g++ -O2":
(a) function test_01 addr & len seem correct
(b) functions test_02, test_03 have negative lengths,
and fun. test_02 length is also incorrect.
addr:4199416, len:36
addr:4199452, len:-72
addr:4199380, len:-36
Test End
This is happening even if I put a "#pragma GCC optimize 0" line in the source file, which is supposed to be the equivalent of a "g++ -O0" command line option.
I don't believe this is true: it is supposed to be the equivalent of attaching __attribute__((optimize(0))) to subsequently defined functions, which causes those functions to be compiled with a different optimisation level. But this does not affect what goes on at the top level, whereas the command line option does.
If you really must do horrible things that rely on top level ordering, try the -fno-toplevel-reorder option. And I suspect that it would be a good idea to add __attribute__((noinline)) to the functions in question as well.
