Group mpi process based on a flag - parallel-processing

I need to create a sub-communicator (mi_comm_world_2) based on a bigger mpi communicator (mpi_comm_world).
In particular, after a detection loop such as
if something exists inside the proc proc
I need to collect in the new communicator mpi_comm_world_2 all the process flagged as true in respect of the check.
I am not able to find a clear documentation in order to do this job.

Good question!
This is what the MPI_Comm_split command is good for.
Definition
int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm)
Arguments
comm: Handle to the communicator you want to want build the new communicator from
color: A nonnegative integer indicating how to group the processes in the new communicators. Processes with the same color are in the same new communicator
key: An integer that controls rank assignment. Ranks are always assigned from 0 to the number of processes in the communicator. The key determines the relative ordering of the processes' ranks in the new communicator.
newcomm: The new communicator
int: The function returns an integer indicating whether it was successful or not.
Longer Explanations
Throughout this bear in mind that you have many processes executing what appears to be the same code. As such, the value of newcomm can differ between processes.
color is an integer value that determines in which of the new sub-communicators the current process will fall. All processes of comm for which color has the same numerical value will be part of the same new sub-communicator newcomm.
For example, if you defined color = rank%4 (see Example 4, below), then you would create (globally) four new communicators. Keep in mind that each process will only be seeing the one of these new communicators which it is part of. Put another way, color determines which the "teams" you will create, like the jersey color of football teams.
key will determines how processes are ranked within the new communicators they are part of. If you set key = rank, then the order of ranking (not the ranking itself) in each new communicator newcomm will follow the order of ranking in the original communicator comm. If two or more values of key have the same value, then the process that had the lower rank in comm has the lower rank in newcomm. (See Example 2, below.)
Examples
Here are a few pictorial examples which I duplicate in code below. Example #5 answers your specific question.
Code for the Examples
//Compile with mpic++ main.cpp
#include <mpi.h>
int main(int argc, char **argv){
int world_size, world_rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &world_size); //Get the number of processes
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); //Get the rank of the process
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Comm newcomm1, newcomm2, newcomm3, newcomm4, newcomm5;
//Example 1: Duplicate the existing communicator. The command `MPI_Comm_dup()`
// does exactly this.
MPI_Comm_split(comm, 0, world_rank, &newcomm1);
//Example 2: Duplicate the existing communicator, but reverse the
// rankings
MPI_Comm_split(comm, 0, world_size-world_rank, &newcomm2);
int rank2; //Get the rank of the process
MPI_Comm_rank(newcomm2, &rank2); //in the new communicator
//Example 3: Split each process into its own communicator. This is the
// equivalent of using `MPI_COMM_SELF` for each process.
MPI_Comm_split(comm, world_rank, world_rank, &newcomm3);
//Example 4: Split processes into communicators based on their colouring. Use
// their rank in the existing communicator to determine their
// relative rank order in the new communicator.
int color = world_rank / 4;
MPI_Comm_split(comm, color, world_rank, &newcomm4);
int rank4; //Get the rank of the process
MPI_Comm_rank(newcomm2, &rank4); //in the new communicator
//Example 5: Group only some of the processes into a new communicator based on
//a flag.
int flag = world_rank%2==0; //An example flag
MPI_Comm_split(comm, flag?0:MPI_UNDEFINED, world_rank, &newcomm5);
MPI_Finalize();
}
More Information
This page has a nice tutorial on communicators and groups.
The SVG file I developed for the examples is available here

Related

Static Analysis erroneously reports out of bounds access

While reviewing a codebase, I came upon a particular piece of code that triggered a warning regarding an "out of bounds access". After looking at the code, I could not see a way for the reported access to happen - and tried to minimize the code to create a reproducible example. I then checked this example with two commercial static analysers that I have access to - and also with the open-source Frama-C.
All 3 of them see the same "out of bounds" access.
I don't. Let's have a look:
3 extern int checker(int id);
4 extern int checker2(int id);
5
6 int compute(int *q)
7 {
8 int res = 0, status;
9
10 status = checker2(12);
11 if (!status) {
12 status = 1;
13 *q = 2;
14 for(int i=0; i<2 && 0!=status; i++) {
15 if (checker(i)) {
16 res = i;
17 status=checker2(i);
18 }
19 }
20 }
21 if (!status)
22 *q = res;
23 return status;
24 }
25
26 int someFunc(int id)
27 {
28 int p;
29 extern int data[2];
30
31 int status = checker2(132);
32 status |= compute(&p);
33 if (status == 0) {
34 return data[p];
35 } else
36 return -1;
37 }
Please don't try to judge the quality of the code, or why it does things the way it does. This is a hacked, cropped and mutated version of the original, with the sole intent being to reach a small example that demonstrates the issue.
All analysers I have access to report the same thing - that the indexing in the caller at line 34, doing the return data[p] may read via the invalid index "2". Here's the output from Frama-C - but note that two commercial static analysers provide exactly the same assessment:
$ frama-c -val -main someFunc -rte why.c |& grep warning
...
why.c:34:[value] warning: accessing out of bounds index. assert p < 2;
Let's step the code in reverse, to see how this out of bounds access at line 34 can happen:
To end up in line 34, the returned status from both calls to checker2 and compute should be 0.
For compute to return 0 (at line 32 in the caller, line 23 in the callee), it means that we have performed the assignment at line 22 - since it is guarded at line 21 with a check for status being 0. So we wrote in the passed-in pointer q, whatever was stored in variable res. This pointer points to the variable used to perform the indexing - the supposed out-of-bounds index.
So, to experience an out of bounds access into the data, which is dimensioned to contain exactly two elements, we must have written a value that is neither 0 nor 1 into res.
We write into res via the for loop at 14; which will conditionally assign into res; if it does assign, the value it will write will be one of the two valid indexes 0 or 1 - because those are the values that the for loop allows to go through (it is bound with i<2).
Due to the initialization of status at line 12, if we do reach line 12, we will for sure enter the loop at least once. And if we do write into res, we will write a nice valid index.
What if we don't write into it, though? The "default" setup at line 13 has written a "2" into our target - which is probably what scares the analysers. Can that "2" indeed escape out into the caller?
Well, it doesn't seem so... if the status checks - at either line 11 or at line 21 fail, we will return with a non-zero status; so whatever value we wrote (or didn't, and left uninitialised) into the passed-in q is irrelevant; the caller will not read that value, due to the check at line 33.
So either I am missing something and there is indeed a scenario that leads to an out of bounds access with index 2 at line 34 (how?) or this is an example of the limits of mainstream formal verification.
Help?
When dealing with a case such as having to distinguish between == 0 and != 0 inside a range, such as [INT_MIN; INT_MAX], you need to tell Frama-C/Eva to split the cases.
By adding //# split annotations in the appropriate spots, you can tell Frama-C/Eva to maintain separate states, thus preventing merging them before status is evaluated.
Here's how your code would look like, in this case (courtesy of #Virgile):
extern int checker(int id);
extern int checker2(int id);
int compute(int *q)
{
int res = 0, status;
status = checker2(12);
//# split status <= 0;
//# split status == 0;
if (!status) {
status = 1;
*q = 2;
for(int i=0; i<2 && 0!=status; i++) {
if (checker(i)) {
res = i;
status=checker2(i);
}
}
}
//# split status <= 0;
//# split status == 0;
if (!status)
*q = res;
return status;
}
int someFunc(int id)
{
int p;
extern int data[2];
int status = checker2(132);
//# split status <= 0;
//# split status == 0;
status |= compute(&p);
if (status == 0) {
return data[p];
} else
return -1;
}
In each case, the first split annotation tells Eva to consider the cases status <= 0 and status > 0 separately; this allows "breaking" the interval [INT_MIN, INT_MAX] into [INT_MIN, 0] and [1, INT_MAX]; the second annotation allows separating [INT_MIN, 0] into [INT_MIN, -1] and [0, 0]. When these 3 states are propagated separately, Eva is able to precisely distinguish between the different situations in the code and avoid the spurious alarm.
You also need to allow Frama-C/Eva some margin for keeping the states separated (by default, Eva will optimize for efficiency, merging states somewhat aggressively); this is done by adding -eva-precision 1 (higher values may be required for your original scenario).
Related options: -eva-domains sign (previously -eva-sign-domain) and -eva-partition-history N
Frama-C/Eva also has other options which are related to splitting states; one of them is the signs domain, which computes information about sign of variables, and is useful to distinguish between 0 and non-zero values. In some cases (such as a slightly simplified version of your code, where status |= compute(&p); is replaced with status = compute(&p);), the sign domain may help splitting without the need for annotations. Enable it using -eva-domains sign (-eva-sign-domain for Frama-C <= 20).
Another related option is -eva-partition history N, which tells Frama-C to keep the states partitioned for longer.
Note that keeping states separated is a bit costly in terms of analysis, so it may not scale when applied to the "real" code, if it contains several more branches. Increasing the values given to -eva-precision and -eva-partition-history may help, as well as adding # split annotations.
I'd like to add some remarks which will hopefully be useful in the future:
Using Frama-C/Eva effectively
Frama-C contains several plug-ins and analyses. Here in particular, you are using the Eva plug-in. It performs an analysis based on abstract interpretation that reports all possible runtime errors (undefined behaviors, as the C standard puts it) in a program. Using -rte is thus unnecessary, and adds noise to the result. If Eva cannot be certain about the absence of some alarm, it will report it.
Replace the -val option with -eva. It's the same thing, but the former is deprecated.
If you want to improve precision (to remove false alarms), add -eva-precision N, where 0 <= N <= 11. In your example program, it doesn't change much, but in complex programs with multiple callstacks, extra precision will take longer but minimize the number of false alarms.
Also, consider providing a minimal specification for the external functions, to avoid warnings; here they contain no pointers, but if they did, you'd need to provide an assigns clause to explicitly tell Frama-C whether the functions modify such pointers (or any global variables, for instance).
Using the GUI and Studia
With the Frama-C graphical interface and the Studia plug-in (accessible by right-clicking an expression of interest and choosing the popup menu Studia -> Writes), and using the Values panel in the GUI, you can easily track what the analysis inferred, and better understand where the alarms and values come from. The only downside is that, it does not report exactly where merges happen. For the most precise results possible, you may need to add calls to an Eva built-in, Frama_C_show_each(exp), and put it inside a loop to get Eva to display, at each iteration of its analysis, the values contained in exp.
See section 9.3 (Displaying intermediate results) of the Eva user manual for more details, including similar built-ins (such as Frama_C_domain_show_each and Frama_C_dump_each, which show information about abstract domains). You may need to #include "__fc_builtin.h" in your program. You can use #ifdef __FRAMAC__ to allow the original code to compile when including this Frama-C-specific file.
Being nitpicky about the term erroneous reports
Frama-C is a semantic-based tool whose main analyses are exhaustive, but may contain false positives: Frama-C may report alarms when they do not happen, but it should never forget any possible alarm. It's a trade-off, you can't have an exact tool in all cases (though, in this example, with sufficient -eva-precision, Frama-C is exact, as in reporting only issues which may actually happen).
In this sense, erroneous would mean that Frama-C "forgot" to indicate some issue, and we'd be really concerned about it. Indicating an alarm where it may not happen is still problematic for the user (and we work to improve it, so such situations should happen less often), but not a bug in Frama-C, and so we prefer using the term imprecisely, e.g. "Frama-C/Eva imprecisely reports an out of bounds access".

MPI: Best way to coordinate many sends and recieves

I've been away from parallel programming for a long period of time and I am trying to figure out the best method for coordinating sending large amounts of data between many processors with a complicated dependency structure. For example, I might to send data to/from the following processes:
int process_1_dependencies[] = {2,3,5,6}
int process_2_dependencies[] = {1}
int process_3_dependencies[] = {1,4,5}
int process_4_dependencies[] = {3,5,6}
int process_5_dependencies[] = {1,3,4,6}
int process_6_dependencies[] = {1,4,5,7}
int process_7_dependencies[] = {6,8}
int process_8_dependencies[] = {7}
The obvious, and stupid, way of doing this would be do something like:
for(int i = 0; i < world_size; i++)
{
for(int j = 0; j < dependency_length; j++)
{
if (i == my_rank)
{
mpi_irecv(...,source=dependency[j],)
}
else
{
if (i == dependency[j])
{
mpi_isend(...,dest=dependency[j])
}
}
}
// blocking stuff?
}
I'm not actually sure if this would work once you have 100's of communications going and in anycase, it seems super inefficient. It's at least O(N) and only allows a single process to be receiving at once. A better way would be to use blocking and ensure that independent processes are simultaneously exchanging information. But that becomes quite complicated and requires optimizing which processes are simultaneously sending and receiving.
Am I just completely overthinking this? Is it safe to do something like this (provided that every sending process has a receiving pair):
for(int i = 0; i < dependency_length; i++)
{
mpi_isend(..., dest=dependency[i], ...)
mpi_irecv(..., source=dependency[i], ...)
}
//blocking stuff
sorry for the lack of focus in the question. I'm away from my computer so I can't really test it out, and in even if it did would I guess I'm not confident that it is saleable and that the buffers would keep working for arbitrary numbers of processes?
To avoid queueing a large number of messages and to avoid opaque deadlock problems, you can also employ a single call to MPI_Alltoallv, where all sends and receives are done for you automatically, and---with crossed fingers--- even hope that you MPI implemetation is able to optimize all communication on its own. The prototype is
MPI_Alltoallv
(
sendbuf, // buffer containing all data needed by other ranks in comm
sendcounts, // number of elements to send to each rank in comm
sdispls, // offsets in sendbuf per rank in comm
sendtype, // MPI datatype of the sent data
recvbuf, // buffer to contain all data needed by this rank
recvcounts, // number of elements to receive per rank in comm
rdispls, // offsets in recvbuf per rank in comm
recvtype, // MPI datatype of the received data
comm // the communicator
);
where sendcounts would be directly related to your process_X_dependencies; it would contain non-zero values at positions listed by process_X_dependencies.

Generate “hash” functions programmatically

I have some extremely old legacy procedural code which takes 10 or so enumerated inputs [ i0, i1, i2, ... i9 ] and generates 170 odd enumerated outputs [ r0, r1, ... r168, r169 ]. By enumerated, I mean that each individual input & output has its own set of distinct value sets e.g. [ red, green, yellow ] or [ yes, no ] etc.
I’m putting together the entire state table using the existing code, and instead of puzzling through them by hand, I was wondering if there was an algorithmic way of determining an appropriate function to get to each result from the 10 inputs. Note, not all input columns may be required to determine an individual output column, i.e. r124 might only be dependent on i5, i6 and i9.
These are not continuous functions, and I expect I might end up with some sort of hashing function approach, but I wondered if anyone knew of a more repeatable process I should be using instead? (If only there was some Karnaugh map like approach for multiple value non-binary functions ;-) )
If you are willing to actually enumerate all possible input/output sequences, here is a theoretical approach to tackle this that should be fairly effective.
First, consider the entropy of the output. Suppose that you have n possible input sequences, and x[i] is the number of ways to get i as an output. Let p[i] = float(x[i])/float(n[i]) and then the entropy is - sum(p[i] * log(p[i]) for i in outputs). (Note, since p[i] < 1 the log(p[i]) is a negative number, and therefore the entropy is positive. Also note, if p[i] = 0 then we assume that p[i] * log(p[i]) is also zero.)
The amount of entropy can be thought of as the amount of information needed to predict the outcome.
Now here is the key question. What variable gives us the most information about the output per information about the input?
If a particular variable v has in[v] possible values, the amount of information in specifying v is log(float(in[v])). I already described how to calculate the entropy of the entire set of outputs. For each possible value of v we can calculate the entropy of the entire set of outputs for that value of v. The amount of information given by knowing v is the entropy of the total set minus the average of the entropies for the individual values of v.
Pick the variable v which gives you the best ratio of information_gained_from_v/information_to_specify_v. Your algorithm will start with a switch on the set of values of that variable.
Then for each value, you repeat this process to get cascading nested if conditions.
This will generally lead to a fairly compact set of cascading nested if conditions that will focus on the input variables that tell you as much as possible, as quickly as possible, with as few branches as you can manage.
Now this assumed that you had a comprehensive enumeration. But what if you don't?
The answer to that is that the analysis that I described can be done for a random sample of your possible set of inputs. So if you run your code with, say, 10,000 random inputs, then you'll come up with fairly good entropies for your first level. Repeat with 10,000 each of your branches on your second level, and the same will happen. Continue as long as it is computationally feasible.
If there are good patterns to find, you will quickly find a lot of patterns of the form, "If you put in this that and the other, here is the output you always get." If there is a reasonably short set of nested ifs that give the right output, you're probably going to find it. After that, you have the question of deciding whether to actually verify by hand that each bucket is reliable, or to trust that if you couldn't find any exceptions with 10,000 random inputs, then there are none to be found.
Tricky approach for the validation. If you can find fuzzing software written for your language, run the fuzzing software with the goal of trying to tease out every possible internal execution path for each bucket you find. If the fuzzing software decides that you can't get different answers than the one you think is best from the above approach, then you can probably trust it.
Algorithm is pretty straightforward. Given possible values for each input we can generate all the input vectors possible. Then per each output we can just eliminate these inputs that do no matter for the output. As the result we for each output we can get a matrix showing output values for all the input combinations excluding the inputs that do not matter for given output.
Sample input format (for code snipped below):
var schema = new ConvertionSchema()
{
InputPossibleValues = new object[][]
{
new object[] { 1, 2, 3, }, // input #0
new object[] { 'a', 'b', 'c' }, // input #1
new object[] { "foo", "bar" }, // input #2
},
Converters = new System.Func<object[], object>[]
{
input => input[0], // output #0
input => (int)input[0] + (int)(char)input[1], // output #1
input => (string)input[2] == "foo" ? 1 : 42, // output #2
input => input[2].ToString() + input[1].ToString(), // output #3
input => (int)input[0] % 2, // output #4
}
};
Sample output:
Leaving the heart of the backward conversion below. Full code in a form of Linqpad snippet is there: http://share.linqpad.net/cknrte.linq.
public void Reverse(ConvertionSchema schema)
{
// generate all possible input vectors and record the resul for each case
// then for each output we could figure out which inputs matters
object[][] inputs = schema.GenerateInputVectors();
// reversal path
for (int outputIdx = 0; outputIdx < schema.OutputsCount; outputIdx++)
{
List<int> inputsThatDoNotMatter = new List<int>();
for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
{
// find all groups for input vectors where all other inputs (excluding current) are the same
// if across these groups outputs are exactly the same, then it means that current input
// does not matter for given output
bool inputMatters = inputs.GroupBy(input => ExcudeByIndexes(input, new[] { inputIdx }), input => schema.Convert(input)[outputIdx], ObjectsByValuesComparer.Instance)
.Where(x => x.Distinct().Count() > 1)
.Any();
if (!inputMatters)
{
inputsThatDoNotMatter.Add(inputIdx);
Util.Metatext($"Input #{inputIdx} does not matter for output #{outputIdx}").Dump();
}
}
// mapping table (only inputs that matters)
var mapping = new List<dynamic>();
foreach (var inputGroup in inputs.GroupBy(input => ExcudeByIndexes(input, inputsThatDoNotMatter), ObjectsByValuesComparer.Instance))
{
dynamic record = new ExpandoObject();
object[] sampleInput = inputGroup.First();
object output = schema.Convert(sampleInput)[outputIdx];
for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
{
if (inputsThatDoNotMatter.Contains(inputIdx))
continue;
AddProperty(record, $"Input #{inputIdx}", sampleInput[inputIdx]);
}
AddProperty(record, $"Output #{outputIdx}", output);
mapping.Add(record);
}
// input x, ..., input y, output z form is needed
mapping.Dump();
}
}

Best way to maintain an RNG state in multiple devices in openCL

So I'm trying to make use of this custom RNG library for openCL:
http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html
The library defines a state struct:
//! Represents the state of a particular generator
typedef struct{ uint x; uint c; } mwc64x_state_t;
And in order to generate a random uint, you pass in the state into the following function:
uint MWC64X_NextUint(mwc64x_state_t *s)
which updates the state, so that when you pass it into the function again, the next "random" number in the sequence will be generated.
For the project I am creating I need to be able to generate random numbers not just in different work groups/items but also across multiple devices simultaneously and I'm having trouble figuring out the best way to design this. Like should I create 1 mwc64x_state_t object per device/commandqueue and pass that state in as a global variable? Or is it possible to create 1 state object for all devices at once?
Or do I not even pass it in as a global variable and declare a new state locally within each kernel function?
The library also comes with this function:
void MWC64X_SeedStreams(mwc64x_state_t *s, ulong baseOffset, ulong perStreamOffset)
Which supposedly is supposed to split up the RNG into multiple "streams" but including this in my kernel makes it incredibly slow. For instance, if I do something very simple like the following:
__kernel void myKernel()
{
mwc64x_state_t rng;
MWC64X_SeedStreams(&rng, 0, 10000);
}
Then the kernel call becomes around 40x slower.
The library does come with some source code that serves as example usages but the example code is kind of limited and doesn't seem to be that helpful.
So if anyone is familiar with RNGs in openCL or if you've used this particular library before I'd very much appreciate your advice.
The MWC64X_SeedStreams function is indeed relatively slow, at least in comparison
to the MWC64X_NextUint call, but this is true of most parallel RNGs that try
to split a large global stream into many sub-streams that can be used in
parallel. The assumption is that you'll be calling NextUint many times
within the kernel (e.g. a hundred or more), but SeedStreams is only at the top.
This is an annotated version of the EstimatePi example that comes with
with the library (mwc64x/test/estimate_pi.cpp and mwc64x/test/test_mwc64x.cl).
__kernel void EstimatePi(ulong n, ulong baseOffset, __global ulong *acc)
{
// One RNG state per work-item
mwc64x_state_t rng;
// This calculates the number of samples that each work-item uses
ulong samplesPerStream=n/get_global_size(0);
// And then skip each work-item to their part of the stream, which
// will from stream offset:
// baseOffset+2*samplesPerStream*get_global_id(0)
// up to (but not including):
// baseOffset+2*samplesPerStream*(get_global_id(0)+1)
//
MWC64X_SeedStreams(&rng, baseOffset, 2*samplesPerStream);
// Now use the numbers
uint count=0;
for(uint i=0;i<samplesPerStream;i++){
ulong x=MWC64X_NextUint(&rng);
ulong y=MWC64X_NextUint(&rng);
ulong x2=x*x;
ulong y2=y*y;
if(x2+y2 >= x2)
count++;
}
acc[get_global_id(0)] = count;
}
So the intent is that n should be large and grow as the number
of work items grow, so that samplesPerStream remains around
a hundred or more.
If you want multiple kernels on multiple devices, then you
need to add another level of hierarchy to the stream splitting,
so for example if you have:
K : Number of devices (possibly on parallel machines)
W : Number work-items per device
C : Number of calls to NextUint per work-item
You end up with N=KWC total calls to NextUint across all
work-items. If your devices are identified as k=0..(K-1),
then within each kernel you would do:
MWC64X_SeedStreams(&rng, W*C*k, C);
Then the indices within the stream would be:
[0 .. N ) : Parts of stream used across all devices
[k*(W*C) .. (k+1)*(W*C) ) : Used within device k
[k*(W*C)+(i*C) .. (k*W*C)+(i+1)*C ) : Used by work-item i in device k.
It is fine if each kernel uses less than C samples, you can
over-estimate if necessary.
(I'm the author of the library).

C/OpenMP - issue with threadprivate and vectors of pointers

I'm new to the world of parallel programming and openmp, so this may be a futile question, but I can't really come up with good answer to what I'm experiencing, so I hope someone will be able to shed some light on the matter.
What I am trying to achieve is to have a private copy of a dinamically allocated matrix (of integers) for every thread that will handle the following parallel section, but as soon as the flow of execution enters said region the reference to the supposedly private matrix holds a null value.
Is there any limitation of this directive I'm not aware of? Everything seems to work just fine with monodimensional dynamic arrays.
A snippet of the code is the following one...
#define n 10000
int **matrix;
#pragma omp threadprivate(matrix)
int main()
{
matrix = (int**) calloc(n, sizeof(int*));
for(i=0;i<n;i++) matrix[i] = (int*) calloc(n, sizeof(int));
AdjacencyMatrix(n, matrix);
...
/* Explicitly turn off dynamic threads */
omp_set_dynamic(0);
#pragma omp parallel
{
// From now on, matrix is NULL...
executor_p(matrix, n);
}
....
Look at the OpenMP documentation regarding what happens with the threadprivate clause:
On first entry to a parallel region, data in THREADPRIVATE variables and common blocks should be assumed undefined, unless a COPYIN clause is specified in the PARALLEL directive
There's no guarantee of what value is going to be stored in the matrix variable in the parallel region.
OpenMP can privatise only variables with known storage size. That is you can have a private copy of an array if it was defined like double matrix[N][M]. In your case is not only the storage size unknown (a pointer doesn't store the number of elements that it is pointing to) but also your matrix is not a contiguous area in memory and rather a pointer to a list of dynamically allocated rows.
What you would end up with is having a private copy of the top-level pointer, not a private copy of the matrix data itself.

Resources