cython memoryviews slices without GIL - parallel-processing

I want to release the GIL in order to parallelise loop in cython, where different slices of memoryviews are passed to a some function inside the loop. The code looks like this:
cpdef void do_sth_in_parallel(bint[:,:] input, bint[:] output, int D):
for d in prange(D, schedule=dynamic, nogil=True):
ouput[d] = some_function_not_requiring_gil(x[d,:])
This is not possible, since selecting the slice x[d,:], seems to require GIL. Running cython -a, and using a normal for loop, I get the code posted below. How can this be done in pure C?
__pyx_t_5.data = __pyx_v_x.data;
__pyx_t_5.memview = __pyx_v_x.memview;
__PYX_INC_MEMVIEW(&__pyx_t_5, 0);
{
Py_ssize_t __pyx_tmp_idx = __pyx_v_d;
Py_ssize_t __pyx_tmp_shape = __pyx_v_x.shape[0];
Py_ssize_t __pyx_tmp_stride = __pyx_v_x.strides[0];
if (0 && (__pyx_tmp_idx < 0))
__pyx_tmp_idx += __pyx_tmp_shape;
if (0 && (__pyx_tmp_idx < 0 || __pyx_tmp_idx >= __pyx_tmp_shape)) {
PyErr_SetString(PyExc_IndexError, "Index out of bounds (axis 0)");
__PYX_ERR(0, 130, __pyx_L1_error)
}
__pyx_t_5.data += __pyx_tmp_idx * __pyx_tmp_stride;
}
__pyx_t_5.shape[0] = __pyx_v_x.shape[1];
__pyx_t_5.strides[0] = __pyx_v_x.strides[1];
__pyx_t_5.suboffsets[0] = -1;
__pyx_t_6.data = __pyx_v_u.data;
__pyx_t_6.memview = __pyx_v_u.memview;
__PYX_INC_MEMVIEW(&__pyx_t_6, 0);
__pyx_t_6.shape[0] = __pyx_v_u.shape[0];
__pyx_t_6.strides[0] = __pyx_v_u.strides[0];
__pyx_t_6.suboffsets[0] = -1;

The following works for me:
from cython.parallel import prange
cdef bint some_function_not_requiring_gil(bint[:] x) nogil:
return x[0]
cpdef void do_sth_in_parallel(bint[:,:] input, bint[:] output, int D):
cdef int d
for d in prange(D, schedule=dynamic, nogil=True):
output[d] = some_function_not_requiring_gil(input[d,:])
The two main changes I had to make were x to input (because it's assuming it can find x as a python object at the global scope) to fix the error
Converting to Python object not allowed without gil
and adding cdef int d to force the type of d and fix the error
Coercion from Python not allowed without the GIL
(I also created an example some_function_not_requiring_gil but I assume this is fairly obvious)

Solution that works for me:
Access the array slice using
input[d:d+1, :]
instead of
input [d,:]
And pass a 2D array.

Related

vim rand() is not deterministic. Is this expected?

According to :help rand(),
rand([{expr}])
Return a pseudo-random Number generated with an xoshiro128**
algorithm using seed {expr}. The returned number is 32 bits,
also on 64 bits systems, for consistency.
{expr} can be initialized by srand() and will be updated by
rand(). If {expr} is omitted, an internal seed value is used
and updated.
Examples:
:echo rand()
:let seed = srand()
:echo rand(seed)
:echo rand(seed) % 16 " random number 0 - 15
It doesn't explain how a seed is changed every time rand() is called, but I expected it to be deterministically altered because
C++'s std::rand() does so,
and Wikipedia says
A pseudorandom number generator (PRNG), also known as a deterministic random bit generator (DRBG), is an algorithm...
However, in the code below, the value of a is deterministic but the values of b are not deterministic; they take different values when you restart the script.
let seed = srand(0)
let a = rand(seed) "deterministic
let b = rand() "not deterministic (why?)
echo [a, b]
let seed = [0, 1, 2, 3]
let a = rand(seed) "deterministic
let b = rand() "not deterministic (why?)
echo [a, b]
Is this an expected behavior? I think the behavior contradicts the documentation.
Environments:
~ $ vi --version
VIM - Vi IMproved 8.2 (2019 Dec 12, compiled Apr 30 2020 13:32:36)
Included patches: 1-664
An algorithm used in Vim is fully deterministic. What creates a confusion is the fact that calling rand(seed) updates the seed "in place", but does not update any internal value(s). Therefore any subsequent rand() uses another (more or less random - quality depends on platform) internal seed value. So if you want to produce fully deterministic sequence, you must consequently invoke rand(seed) with the same variable.
This behaviour is easy to deduce from Vim's source code. Also :h rand() says that:
Return a pseudo-random Number generated with an xoshiro128**
algorithm using seed {expr}. The returned number is 32 bits,
also on 64 bits systems, for consistency.
{expr} can be initialized by srand() and will be updated by
rand(). If {expr} is omitted, an internal seed value is used
and updated.
If you find the wording misleading you can open an issue on github.
The documentation is badly written but the behavior is actually the expected one from the source code's perspective.
Analysis
rand() is defined as f_rand() in src/evalfunc.c. From the snippet at the end of this answer, we know some things:
f_rand() has only two sets of static variables: gx, ..., gw and initialized.
gx, ..., gw are the internal seeds. Their values are touched and referenced only when f_rand() is called with no argument (i.e. when argvars[0].v_type == VAR_UNKNOWN).
initialized remembers if f_rand() has ever been called with no argument and it is also touched and referenced only when f_rand() is called with no argument.
When f_rand() is called with a seed,
The value of the seed is used once and that is not saved as a static variable. In other words, the sentence "{expr} can be initialized by srand() and will be updated by rand()" in the documentation is nothing but a "lie"; {expr} is not remembered and thus not updated by the subsequent f_rand().
The value of the seed is updated in place via the pointers lx, ..., lw.
Conclusion
The sentence
{expr} can be initialized by srand() and will be updated by rand()
shall be modified to
{expr} can be initialized by srand() and will be updated by rand({expr}). You may want to store a seed into a variable and pass it to rand() since {expr} is not remembered in the function.
If you need the deterministic rand(), do this:
let seed = srand(0)
let a = rand(seed) "The value of `seed` is changed in place.
let b = rand(seed) "ditto
echo [a, b]
The Source Code of rand()
#define ROTL(x, k) ((x << k) | (x >> (32 - k)))
#define SPLITMIX32(x, z) ( \
z = (x += 0x9e3779b9), \
z = (z ^ (z >> 16)) * 0x85ebca6b, \
z = (z ^ (z >> 13)) * 0xc2b2ae35, \
z ^ (z >> 16) \
)
#define SHUFFLE_XOSHIRO128STARSTAR(x, y, z, w) \
result = ROTL(y * 5, 7) * 9; \
t = y << 9; \
z ^= x; \
w ^= y; \
y ^= z, x ^= w; \
z ^= t; \
w = ROTL(w, 11);
/*
* "rand()" function
*/
static void
f_rand(typval_T *argvars, typval_T *rettv)
{
list_T *l = NULL;
static UINT32_T gx, gy, gz, gw;
static int initialized = FALSE;
listitem_T *lx, *ly, *lz, *lw;
UINT32_T x, y, z, w, t, result;
if (argvars[0].v_type == VAR_UNKNOWN)
{
// When no argument is given use the global seed list.
if (initialized == FALSE)
{
// Initialize the global seed list.
init_srand(&x);
gx = SPLITMIX32(x, z);
gy = SPLITMIX32(x, z);
gz = SPLITMIX32(x, z);
gw = SPLITMIX32(x, z);
initialized = TRUE;
}
SHUFFLE_XOSHIRO128STARSTAR(gx, gy, gz, gw);
}
else if (argvars[0].v_type == VAR_LIST)
{
l = argvars[0].vval.v_list;
if (l == NULL || list_len(l) != 4)
goto theend;
lx = list_find(l, 0L);
ly = list_find(l, 1L);
lz = list_find(l, 2L);
lw = list_find(l, 3L);
if (lx->li_tv.v_type != VAR_NUMBER) goto theend;
if (ly->li_tv.v_type != VAR_NUMBER) goto theend;
if (lz->li_tv.v_type != VAR_NUMBER) goto theend;
if (lw->li_tv.v_type != VAR_NUMBER) goto theend;
x = (UINT32_T)lx->li_tv.vval.v_number;
y = (UINT32_T)ly->li_tv.vval.v_number;
z = (UINT32_T)lz->li_tv.vval.v_number;
w = (UINT32_T)lw->li_tv.vval.v_number;
SHUFFLE_XOSHIRO128STARSTAR(x, y, z, w);
lx->li_tv.vval.v_number = (varnumber_T)x;
ly->li_tv.vval.v_number = (varnumber_T)y;
lz->li_tv.vval.v_number = (varnumber_T)z;
lw->li_tv.vval.v_number = (varnumber_T)w;
}
else
goto theend;
rettv->v_type = VAR_NUMBER;
rettv->vval.v_number = (varnumber_T)result;
return;
theend:
semsg(_(e_invarg2), tv_get_string(&argvars[0]));
rettv->v_type = VAR_NUMBER;
rettv->vval.v_number = -1;
}

GLSL optimization: check if variable is within range

In my shader I have variable b and need to determine within which range it lies and from that assign the right value to variable a. I ended up with a lot of if statements:
float a = const1;
if (b >= 2.0 && b < 4.0) {
a = const2;
} else if (b >= 4.0 && b < 6.0) {
a = const3;
} else if (b >= 6.0 && b < 8.0) {
a = const4;
} else if (b >= 8.0) {
a = const5;
}
My question is could this lead to performance issues (branching) and how can I optimize it? I've looked at the step and smoothstep functions but haven't figured out a good way to accomplish this.
To solve the problem depicted and avoid branching the usual techniques is to find a series of math functions, one for each condition, that evaluate to 0 for all the conditions except the one the variable satisfies. We can use these functions as gains to build a sum that evaluates to the right value each time.
In this case the conditions are simple intervals, so using the step functions we could write:
x in [a,b] as step(a,x)*step(x,b) (notice the inversion of x and b to get x<=b)
Or
x in [a,b[ as step(a,x)-step(x,b) as explained in this other post: GLSL point inside box test
Using this technique we obtain:
float a = (step(x,2.0)-((step(2.0,x)*step(x,2.0)))*const1 +
(step(2.0,x)-step(4.0,x))*const2 +
(step(4.0,x)-step(6.0,x))*const3 +
(step(6.0,x)-step(8.0,x))*const4 +
step(8.0,x)*const5
This works for general disjoint intervals, but in the case of a step or staircase function as in this question, we can simplify it as:
float a = const1 + step(2.0,x)*(const2-const1) +
step(4.0,x)*(const3-const2) +
step(6.0,x)*(const4-const3) +
step(8.0,x)*(const5-const4)
We could also use a 'bool conversion to float' as means to express our conditions, so as an example step(8.0,x)*(const5-const4) is equivalent to float(x>=8.0)*(const5-const4)
You can avoid branching by creating kind of a lookup table:
float table[5] = {const1, const2, const3, const4, const5};
float a = table[int(clamp(b, 0.0, 8.0) / 2)];
But the performance will depend on whether the lookup table will have to be created in every shader or if it's some kind of uniform... As always, measure first...
It turned out Jaa-cs answere wasn't viable for me as I'm targeting WebGL which doesn't allow variables as indexes (unless it's a loop index). His solution might work great for other OpenGL implementations though.
I came up with this solution using mix and step functions:
//Outside of main function:
uniform vec3 constArray[5]; // Values are sent in to shader
//Inside main function:
float a = constArray[0];
a = mix(a, constArray[1], step(2.0, b));
a = mix(a, constArray[2], step(4.0, b));
a = mix(a, constArray[3], step(6.0, b));
a = mix(a, constArray[4], step(8.0, b));
But after some testing it didn't give any visible performance boost. I finally ended up with this solution:
float a = constArray[0];
if (b >= 2.0)
a = constArray[1];
if (b >= 4.0)
a = constArray[2];
if (b >= 6.0)
a = constArray[3];
if (b >= 8.0)
a = constArray[4];
Which is both compact and easily readable. In my case both these alternatives and my original code performed equally, but at least here are some options to try out.

Mata error 3204

I am unsure why I am getting an error.
I think it may stem from a misunderstanding around the structure syntax, but I am not certain if this is the issue (it would be unsurprising if there are multiple issues).
I am emulating code (from William Gould's The Mata Book) in which the input is a scalar, but the input for the program I am writing is a colvector.
The objective of this exercise is to create a square matrix from a column vector (according to some rules) and once created, multiply this square matrix by itself.
The code is the following:
*! spatial_lag version 1.0.0
version 15
set matastrict on
//--------------------------------------------------------------
local SL struct laginfo
local RS real scalar
local RC real colvector
local RM real matrix
//--------------------------------------------------------------
mata
`SL'
{
//-------------------inputs:
`RC' v
//-------------------derived:
`RM' W
`RM' W2
`RS' n
}
void lagset(`RC' v)
{
`SL' scalar r
// Input:
r.v = v
//I set the derived variables to missing:
r.W = .z
r.W2 = .z
r.n = .z // length of vector V
}
`RM' w_mat(`SL' scalar r)
{
if (r.W == .z) {
real scalar row, i
real scalar col, j
r.W = J(r.n,r.n,0)
for (i=1; i<=r.n; i++) {
for (i=1; i<=r.n; i++) {
if (j!=i) {
if (r.v[j]==r.v[i]) {
r.W[i,j] = 1
}
}
}
}
}
return(r.W)
}
`RS' wlength(`SL' scalar r)
{
if (r.n == .z) {
r.n = length(r.v)
}
return(r.n)
}
`RM' w2mat(`SL' scalar r)
{
if (r.W2 == .z) {
r.W2 = r.W * r.W
}
return(r.W2)
}
end
This compiles without a problem, but it give an error when I attempt to use it interactively as follows:
y=(1\1\1\2\2\2)
q = lagset(y)
w_mat(q)
w2mat(q)
The first two lines run fine, but when I run the last two of those lines, I get:
w_mat(): 3204 q[0,0] found where scalar required
<istmt>: - function returned error
What am I misunderstanding?
This particular error is unrelated to structures. Stata simply complains because the lagset() function is void. That is, it does not return anything. Thus, q ends up being empty, which is in turn used as input in the function w_mat() inappropriately - hence the q[0,0] reference.

rfactor schedule for descriptor matching

I'm trying to use Halide for brute-force descriptor (e.g SIFT) matching. I'd like to try rfactor in the schedule, but I can't seem to get the associativity prover to oblige. So far I have the following:
Var c("c"), i("i");
Func diff("diff"), diffSq("diffSq"), dotp("dotp"), out("out"),
inp1("inp1"), inp2("inp2"), minVal("minVal");
inp1(c,x) = input1(c,x);
inp2(c,y) = input2(c,y);
diff(x,y,c) = inp1(c, x) - inp2(c, y);
diffSq(x,y,c) = diff(x,y,c) * diff(x,y,c);
RDom rc(0,128);
dotp(x, y) = 0.f;
dotp(x, y) += diffSq(x, y, rc);
// Argmin, see https://github.com/halide/Halide/blob/master/test/correctness/rfactor.cpp#L804
RDom ry(0, input2.height(), "ry");
minVal(x) = {-1, std::numeric_limits<float>::max()};
minVal(x) = {
select(minVal(x)[1] < dotp(x, ry)
,minVal(x)[0]
,ry),
min(minVal(x)[1], dotp(x, ry))
};
out(x) = minVal(x)[0];
// Schedule
RVar ryo("ryo"), ryi("ryi");
Var yy("yy");
Func intermediate("inter");
dotp.compute_root();
minVal.update(0).split(ry, ryo, ryi, 16);
//intermediate = minVal.update(0).rfactor(ryo, yy);
The last, uncommented line sadly fails with:
|| Failed to call rfactor() on minVal.update(0) since it can't prove associativity of the operator
Thanks for any pointers as to how I could resolve this!
Quick answer: only one order of the Tuple elements is matched. Flipping them should allow rfactor. There will be a more complete answer on the list and we'll look at generalizing the matcher. (Answering to make sure the SO side doesn't get forgotten.)

Best Practices with Initialization or Pre-allocation - MATLAB

My question doesn't depend expressly on one snippet of code, but is more conceptual.
Unlike some programming languages, MATLAB doesn't require variables to be initialized expressly before they're used. For example, this is perfectly valid to have halfway through a script file to define 'myVector':
myVector = vectorA .* vectorB
My question is: Is it faster to initialize variables (such as 'myVector' above) to zero and then assign values to them, or to keep initializing things throughout the program?
Here's a direct comparison of what I'm talking about:
Initializing throughout:
varA = 8;
varB = 2;
varC = varA - varB;
varD = varC * varB;
Initializing at start:
varA = 8;
varB = 2;
varC = 0;
varD = 0;
varC = varA - varB;
varD = varC * varB;
On one hand, it seems a bit of a waste to have these extra lines of code for no reason. On the other hand, though, it makes a little bit of sense that it would be faster to allocate all the memory for a program at once instead of spread out over the runtime.
Does anyone have a little insight?
Copy and paste your Initializing at start: code into MATLAB Editor Window and you would get this warning that looks like this -
And if you go into the Details, you would read this -
Explanation
The code does not appear to use the assignment to the indicated variable. This situation occurs when any of the following are true:
Another assignment overwrites the value of the variable before an operation uses it.
The specified argument value contains a typographical error, causing it to appear unused.
The code does not use all values returned by a function call...
In our case, the reason for this warning is The code does not use all values. So, this clarifies that initialization/pre-allocation won't help for that case.
When should we pre-allocate?
From my experience, pre-allocation helps when you need to later on index into part of it.
Thus, if you need to index into a portion of varC to store the results, pre-allocation would help. Hence, this would make more sense -
varC = zeros(...)
varD = zeros(...)
varC(k,:) = varA - varB;
varD(k,:) = varC * varB;
Again, while indexing if you are going beyond the size of varC, MATLAB would spend time trying to allocate more memory space for it, so that would slow things a bit. So, pre-allocate output variables to the maximum size which you think would be used for storing results. But, if you don't know the size of results, you are in a catch there and have to append results into the output variable(s) and that would slow down things for sure.
Alright! I've done some tests, and here are the results.
This is the code I used for the "throughout" variable assignments:
tic;
a = 1;
b = 2;
c = 3;
d = 4;
e = a - b;
f = e + c;
g = f - a;
h = g * c;
i = h - g;
j = 9 * i;
k = [j i h];
l = any(k);
b2(numel(b2) + 1) = toc
Here's the code for the "At Start" variable assignments:
tic;
a = 1;
b = 2;
c = 3;
d = 4;
e = 0;
f = 0;
g = 0;
h = 0;
i = 0;
j = 0;
k = 0;
l = 0;
e = a - b;
f = e + c;
g = f - a;
h = g * c;
i = h - g;
j = 9 * i;
k = [j i h];
l = any(k);
b1(numel(b1) + 1) = toc
I saved the time in the vectors 'b1' and 'b2'. Each was run with only MATLAB and Chrome open, and was the only script file open inside MATLAB. Each was run 201 times. Because the first time a program is run it compiles, I disregarded the first time value for both (I'm not interested in compile time).
To find the average, I used
mean(b1(2:201))
and
mean(b2(2:201))
The results:
"Throughout": 1.634311562062418e-05 seconds (0.000016343)
"At Start": 2.832598989758290e-05 seconds (0.000028326)
Interestingly (or perhaps not, who knows) defining variables only when needed, spread throughout the program was almost twice as fast.
I don't know whether this is because of the way MATLAB allocates memory (maybe it just grabs a huge chunk and doesn't need to keep allocating more every time you define a variable?) or if the allocation speed is just so fast that it's eclipsed by the extra lines of code.
NOTE: As Divakar points out, mileage may vary when using arrays. My testing should hold true for when the size of variables doesn't change, however.
tl;dr Setting variables to zero only to change it later is slow

Resources