global shared variable in halide #halide# - halide

I'm trying to define a global shared variable, which counts a non-zero elements in a input, like this:
counter = 0
N += select(input[x,y] > 10, 1, 0)
however, this is very hard in halide, is there any global shared variable that targets this goal?

In general, Halide is designed to avoid mutable state, since it constraints possible transformations and optimizations—it’s the enemy of portable performance. In this case, however, you can do exactly what you want by making counter a 0-dimensional Func and using an RDom to iterate over the elements in input:
Func counter;
RDom r(0, input.width(), 0, input.height());
counter() = 0;
counter() += select(input(r.x, r.y) > 10, 1, 0);
You can also write this in many other ways, using the sum helper, potentially using r.where for the predicate, and using the counter.rfactor scheduling directive to parallelize the reduction.

Related

How does img.At(x, y) correlate with a uint32[][] structure [duplicate]

I am learning Go by going through A Tour of Go. One of the exercises there asks me to create a 2D slice of dy rows and dx columns containing uint8. My current approach, which works, is this:
a:= make([][]uint8, dy) // initialize a slice of dy slices
for i:=0;i<dy;i++ {
a[i] = make([]uint8, dx) // initialize a slice of dx unit8 in each of dy slices
}
I think that iterating through each slice to initialize it is too verbose. And if the slice had more dimensions, the code would become unwieldy. Is there a concise way to initialize 2D (or n-dimensional) slices in Go?
There isn't a more concise way, what you did is the "right" way; because slices are always one-dimensional but may be composed to construct higher-dimensional objects. See this question for more details: Go: How is two dimensional array's memory representation.
One thing you can simplify on it is to use the for range construct:
a := make([][]uint8, dy)
for i := range a {
a[i] = make([]uint8, dx)
}
Also note that if you initialize your slice with a composite literal, you get this for "free", for example:
a := [][]uint8{
{0, 1, 2, 3},
{4, 5, 6, 7},
}
fmt.Println(a) // Output is [[0 1 2 3] [4 5 6 7]]
Yes, this has its limits as seemingly you have to enumerate all the elements; but there are some tricks, namely you don't have to enumerate all values, only the ones that are not the zero values of the element type of the slice. For more details about this, see Keyed items in golang array initialization.
For example if you want a slice where the first 10 elements are zeros, and then follows 1 and 2, it can be created like this:
b := []uint{10: 1, 2}
fmt.Println(b) // Prints [0 0 0 0 0 0 0 0 0 0 1 2]
Also note that if you'd use arrays instead of slices, it can be created very easily:
c := [5][5]uint8{}
fmt.Println(c)
Output is:
[[0 0 0 0 0] [0 0 0 0 0] [0 0 0 0 0] [0 0 0 0 0] [0 0 0 0 0]]
In case of arrays you don't have to iterate over the "outer" array and initialize "inner" arrays, as arrays are not descriptors but values. See blog post Arrays, slices (and strings): The mechanics of 'append' for more details.
Try the examples on the Go Playground.
There are two ways to use slices to create a matrix. Let's take a look at the differences between them.
First method:
matrix := make([][]int, n)
for i := 0; i < n; i++ {
matrix[i] = make([]int, m)
}
Second method:
matrix := make([][]int, n)
rows := make([]int, n*m)
for i := 0; i < n; i++ {
matrix[i] = rows[i*m : (i+1)*m]
}
In regards to the first method, making successive make calls doesn't ensure that you will end up with a contiguous matrix, so you may have the matrix divided in memory. Let's think of an example with two Go routines that could cause this:
The routine #0 runs make([][]int, n) to get allocated memory for matrix, getting a piece of memory from 0x000 to 0x07F.
Then, it starts the loop and does the first row make([]int, m), getting from 0x080 to 0x0FF.
In the second iteration it gets preempted by the scheduler.
The scheduler gives the processor to routine #1 and it starts running. This one also uses make (for its own purposes) and gets from 0x100 to 0x17F (right next to the first row of routine #0).
After a while, it gets preempted and routine #0 starts running again.
It does the make([]int, m) corresponding to the second loop iteration and gets from 0x180 to 0x1FF for the second row. At this point, we already got two divided rows.
With the second method, the routine does make([]int, n*m) to get all the matrix allocated in a single slice, ensuring contiguity. After that, a loop is needed to update the matrix pointers to the subslices corresponding to each row.
You can play with the code shown above in the Go Playground to see the difference in the memory assigned by using both methods. Note that I used runtime.Gosched() only with the purpose of yielding the processor and forcing the scheduler to switch to another routine.
Which one to use? Imagine the worst case with the first method, i.e. each row is not next in memory to another row. Then, if your program iterates through the matrix elements (to read or write them), there will probably be more cache misses (hence higher latency) compared to the second method because of worse data locality. On the other hand, with the second method it may not be possible to get a single piece of memory allocated for the matrix, because of memory fragmentation (chunks spread all over the memory), even though theoretically there may be enough free memory for it.
Therefore, unless there's a lot of memory fragmentation and the matrix to be allocated is huge enough, you would always want to use the second method to get advantage of data locality.
With Go 1.18 you get generics.
Here is a function that uses generics to allow to create a 2D slice for any cell type.
func Make2D[T any](n, m int) [][]T {
matrix := make([][]T, n)
rows := make([]T, n*m)
for i, startRow := 0, 0; i < n; i, startRow = i+1, startRow+m {
endRow := startRow + m
matrix[i] = rows[startRow:endRow:endRow]
}
return matrix
}
With that function in your toolbox, your code becomes:
a := Make2D[uint8](dy, dx)
You can play with the code on the Go Playground.
Here a consive way to do it:
value := [][]string{}{[]string{}{"A1","A2"}, []string{}{"B1", "B2"}}
PS.: you can change "string" to the type of element you're using in your slice.

Halide - sort buffer/function in one dimension

I am currently using Halide with the use of a generator and ahead of time compilation.
Somewhere in the pipeline I have a 3D buffer with limited extent (typically 3-6 values) in one of the dimensions.
I would like to sort the values in that dimension.
When I skip the processing at the beginning of the pipe line,
it looks somewhat like this:
Input < Buffer<uint16_t>> input { "input" , 2}; // Dimensions: (y, x)
Input < uint8_t> > sizeZ { "sizeZ" }; // Size in Z-dimension
Output< Buffer<uint16_t>> output { "output", 3}; // Dimensions: (z, y, x)
Var x,y,z;
Func input3D(z,y,x) = input(y,z+x*sizeZ);
output = 'sort input3D on Z dimension'.
I would be most helped if some sorting functionality is already available in Halide (is that so?).
An alternative would be to call an external C implementation to sort all values in that dimension and assign them to the output buffer.
That would be something like:
output(:, y, x) = external_sort(input(:, y, x))
In which I used the Python notation to express all elements in Z dimension.
Is something like this possible in Halide?
There is an example of calling an external C function to sort a Halide func in our tests here: https://github.com/halide/Halide/blob/master/test/correctness/extern_sort.cpp
General purpose sorting algorithms cannot be expressed in Halide. However, sorting networks for small vectors can be. See here for an example of bitonic sorting: https://github.com/halide/Halide/blob/master/test/correctness/sort_exprs.cpp

PyOpenCL - Multi-dimensional reduction kernel

I'm a total newbie to OpenCL.
I'm trying to code a reduction kernel that sums along one axis for a multi-dimensional array. I have stumbled upon that code which comes from here: https://tmramalho.github.io/blog/2014/06/16/parallel-programming-with-opencl-and-python-parallel-reduce/
__kernel void reduce(__global float *a, __global float *r, __local float *b) {
uint gid = get_global_id(0);
uint wid = get_group_id(0);
uint lid = get_local_id(0);
uint gs = get_local_size(0);
b[lid] = a[gid];
barrier(CLK_LOCAL_MEM_FENCE);
for(uint s = gs/2; s > 0; s >>= 1) {
if(lid < s) {
b[lid] += b[lid+s];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0) r[wid] = b[lid];
}
I don't understand the for loop part. I get that uint s = gs/2 means that we split the array in half, but then it is a complete mystery. Without understanding it, I can't really implement another version for taking the maximum of an array for instance, even less for multi-dimensional arrays.
Furthermore, as far as I understand, the reduce kernel needs to be rerun another time if "N is bigger than the number of cores in a single unit".
Could you give me further explanations on that whole piece of code? Or even guidance on how to implement it for taking the max of an array?
Complete code can be found here: https://github.com/tmramalho/easy-pyopencl/blob/master/008_localreduce.py
Your first question about the meaning of the for loop:
for(uint s = gs/2; s > 0; s >>= 1)
It means that you divide the local size gs by 2, and keep dividing by 2 (the shift part s >>= 1 is equivalent to s = s/2) while s > 0, in other words, until s = 1. This algorithm depends on your array's size being a power of 2, otherwise you'd have to deal with the excess of a power of 2 until you have reduced the whole array, or you'd have to fill your array with neutral values for the reduction until completing a power of 2 size.
Your second concern when N is bigger than the capacity of your GPU, you are right: you have to run your reduction in portions that fit and then merge the results.
Finally, when you ask for guidance on how to implement a reduction to get the max of an array, I would suggest the following:
For a simple reduction like max or sum, try using numpy, especially if you are dealing with programming the reduction by axis.
If you think that the GPU would give you an advantage, try first using pyopencl's Multidimensional Array functionality, e.g. max.
If the reduction is more math intensive, try using pyopencl's Parallel Algorithms, e.g. reduction
I think that the whole point of using pyopencl is to avoid dealing with the underlying GPU's architecture. Otherwise, it is easier to deal with CUDA or HIP directly instead of OpenCL.

Halide: How to deal with Expr evaluated as nan or inf?

I have a 1D Func over which I'd like to perform the following: take the sum of a kernel of n values, and divide it by the sum of the kernel shifted by 1. Here's the code I have so far:
Var x("x");
Func result("result");
RDom r(0, kernel_size);
Expr sum1 = sum(vec_func(x+r));
Expr sum2 = sum(vec_func(x+r+1));
Expr quotient = sum1 / sum2;
result(x) = quotient;
This is an example of the type of calculation which might result in a NaN or Inf. Ideally I would be able to deal with this in Halide using something like this:
Expr safe_calc = select(isnan(quotient) || isinf(quotient), 0, quotient);
result(x) = quotient;
Does such a method exist in Halide?
Expr Halide::is_nan(Expr) exists right now, but we are missing is_finite. (Added as https://github.com/halide/Halide/issues/2497)
However: be aware that Halide does floating point math in accordance with -ffast-math rules, which means it is allowed to optimize the code in ways that assume NaN/Inf values can't happen. If it's possible to structure your code in a way to ensure such values aren't possible, you should do so.

Random Numbers with OpenCL using Random123

I have been looking at this lib Random123 and associated quote:
One mysterious man came to my booth and asked what I knew about generating random numbers with OpenCL. I told him about implementations of the Mersenne Twister, but he wasn't impressed. He told me about a new technical paper that explains how to generate random numbers on GPUs by combining integer counters and block ciphers. In reverential tones, he said that counter-based random number generators (CBRNGs) produce numbers with greater statistical randomness than the MT and with much greater speed.
I was able to get a demo running using this kernel:
__kernel void counthits(unsigned n, __global uint2 *hitsp) {
unsigned tid = get_global_id(0);
unsigned hits = 0, tries = 0;
threefry4x32_key_t k = {{tid, 0xdecafbad, 0xfacebead, 0x12345678}};
threefry4x32_ctr_t c = {{0, 0xf00dcafe, 0xdeadbeef, 0xbeeff00d}};
while (tries < n) {
union {
threefry4x32_ctr_t c;
int4 i;
} u;
c.v[0]++;
u.c = threefry4x32(c, k);
long x1 = u.i.x, y1 = u.i.y;
long x2 = u.i.z, y2 = u.i.w;
if ((x1*x1 + y1*y1) < (1L<<62)) {
hits++;
}
tries++;
if ((x2*x2 + y2*y2) < (1L<<62)) {
hits++;
}
tries++;
}
hitsp[tid].x = hits;
hitsp[tid].y = tries;
}
My questions are now, will this not generate the same random numbers every time its run, a random number is based on the global id ? How can I generate new random numbers each time. Possible to provide a seed as a parameter for the kernel and then use that somehow?
Anyone who have been using this lib and can give me some more insight in the use of it?
Yes. The example code generates the same sequences of random numbers every time it is called.
To get different streams of random numbers, just initialize any of the values k[1..3] and/or c[1..3] differently. You can initialize them from command line arguments, environment variables, time-of-day, saved state, /dev/urandom, or any other source. Just be aware that:
a) if you initialize all of them exactly the same way in two different runs, then those two runs will get the same stream of random numbers
b) if you initialize them differently in two different runs, then those two runs will get different streams of random numbers.
Sometimes you want property a). Sometimes you want property b). Take a moment to think about which you want and be sure that you're doing what you intend.
More generally, the functions in the library, e.g., threefry4x32, have no state. If you change any bit in the input (i.e., any bit in any of the elements of c or k), you'll get a completely different random, statistically independent, uniformly distributed output.
P.S. I'm one of the authors of the library and the paper "Parallel Numbers: As Easy as 1, 2, 3":
http://dl.acm.org/citation.cfm?id=2063405
If you're not a subscriber to the ACM digital library, the link above may hit a pay-wall. Alternatively, you can obtain the paper free of charge by following the link on this page:
http://www.thesalmons.org/john/random123/index.html
I can't help you with the library per se, but I can tell you that the most common way to generate random numbers in OpenCL is to save some state between calls to the kernel.
Random number generators usually use a state, from which a new state and a random number are generated. In practice, this isn't complicated at all: you just pass an extra array that holds state. In my codes, I implement random numbers as follows:
uint rand_uint(uint2* rvec) { //Adapted from http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html
#define A 4294883355U
uint x=rvec->x, c=rvec->y; //Unpack the state
uint res = x ^ c; //Calculate the result
uint hi = mul_hi(x,A); //Step the RNG
x = x*A + c;
c = hi + (x<c);
*rvec = (uint2)(x,c); //Pack the state back up
return res; //Return the next result
#undef A
}
inline float rand_float(uint2* rvec) {
return (float)(rand_uint(rvec)) / (float)(0xFFFFFFFF);
}
__kernel void my_kernel(/*more arguments*/ __global uint2* randoms) {
int index = get_global_id(0);
uint2 rvec = randoms[index];
//Call rand_uint or rand_float a number of times with "rvec" as argument.
//These calls update "rvec" with new state, and return a random number
randoms[index] = rvec;
}
. . . then, all you do is pass an extra array that holds the RNG's state into random. In practice, you'll want to seed this array differently for each work item.
0xdecafbad, 0xfacebead, 0x12345678 and 0xf00dcafe, 0xdeadbeef, 0xbeeff00d are just arbitrarily chosen numbers, they're not special. Any other number (even 0) could be used in their place -- I'll add a comment to the example code.
You can replace any of them with variables that you pass in; the only requirement for avoiding undesirable repetition in the output random "stream" is that you avoid repeating the (c, k) input tuple. The example code uses the thread id and loop index to ensure uniqueness, but you can easily add more variables to ensure uniqueness -- e.g. count the kernel invocations in the host code and pass that counter in, use that in place of one of the elements of k or c.
By the way, despite the name 'Counter-based random number generator', there's no requirement that the inputs (c, k) be 'counters', it's just that counters happen to be the most convenient idiom for ensuring that inputs don't repeat.

Resources