Performing different tasks for different data items in OpenCL?

Performing different tasks for different data items in OpenCL? - performance

In summary, I'm looking for ways to deal with a situation where the very first step in the calculation is a conditional branch between two computationally expensive branches.
I'm essentially trying to implement a graphics filter that operates on an image and a mask - the mask is a bitmap array the same size as the image, and the filter performs different operations according to the value of the mask. So I basically want to do something like this for each pixel:
if(mask == 1) {
foo();
} else {
bar();
}
where both foo and bar are fairly expensive operations. As I understand it, when I run this code on the GPU it will have to calculate both branches for every pixel. (This gets even more expensive if there are more than two possible values for the mask.) Is there any way to avoid this?
One option I can think of would be to, in the host code, sort all the pixels into two 1-dimensional arrays based on the value of the mask at that point, and then entirely different kernels on them; then reconstruct the image from the two datasets afterwards. The problem with this is that, in my case, I want to run the filter iteratively, and both the image and the mask change with each iteration (the mask is actually calculated from the image). If I'm splitting the image into two buckets in the host code, I have to transfer each iteration of the image and mask from the GPU, and then the new buckets back to the GPU, introducing a new bottleneck to replace the old one.
Is there any other way to avoid this bottleneck?

Another approach might be to do a simple bucket sort within each work-group using the mask.
So add a local memory array and atomic counter for each value of mask. First read a pixel (or set of pixels might be better) for each work item, increment the appropriate atomic count and write the pixel address into that location in the array.
Then perform a work-group barrier.
Then as a final stage assign some set of work-items, maybe a multiple of the underlying vector size, to each of those arrays and iterate through it. Your operations will then be largely efficient, barring some loss at the ends, and if you look at enough pixels per work-item you may have very little loss of efficiency even if you assign the entire group to one mask value and then the other in turn.
Given that your description only has two values of the mask, fitting two arrays into local memory should be pretty simple and scale well.

Push demanding task of a thread to shared/local memory(synchronization slows the process) and execute light ones untill all light ones finish(so the slow sync latency is hidden by this), then execute heavier ones.
if(mask == 1) {
uploadFoo();//heavy, upload to __local object[]
} else {
processBar(); // compute until, then check for a foo() in local memory if any exists.
downloadFoo();
}
using a producer - consumer approach maybe.

Related

Is MPI shared memory a good solution for my problem?

I have several processes, each of them calculates certain sub-matrices of one global matrix. The problem is that the sub-matrices will overlap and in general they do not necessarily have to form a continuous block within the global matrix. Also each tasks might also have more than one sub-matrix.
Finally, in order to obtain my final matrix I need to perform an element wise summation of these sub-matrices by considering the position within the global matrix.
So far I am doing the following:
each processor has its own copy of the global array (matrix)
each processor then calculates a sub-matrix of that global matrix and adds the elements to the right position in the local copy of the global array
with mpi_allreduce I am obtaining the final global matrix synchronized over all the tasks (this is my element wise summation to obtain my final result)
This works reasonably well as long as my global matrix is small. However, this becomes quickly a memory bottleneck as allocating local copies of the global matrix becomes more and more expensive.
One constraint is that I have to solve this with MPI only.
Another constraint is that I need to perform operations on that global matrix afterwards. Where different task have access (this time read-only) different parts of that global matrix. The blocks are not the same as the sub-matrix blocks before.
I somehow stumbled along MPI-3 shared memory arrays. However, I am not sure if this might be the best solution for my problem as several processes have to add simultaneously small local arrays which overlap. However, for my operations afterwards, each process could also read from that global matrix again.
I am relatively inexperienced how to solve these kind of problems and I would be happy for any kind of suggestions.
Thanks!

How to save a matrix in C++ in a non-linear way

I have to program an optimized multi-thread implementation of the Levenshtein distance problem. It can be computed using dynamic programming with a matrix, the wikipedia page on Levenshtein distance covers that well enough.
Now, I can compute diagonal elements concurrently. That is all alright.
My problem now comes with caches. Matrices in c++ are normaly saved in memory row by row, correct? Well, that is not good for me as I need 2 element of the previous row and 1 element of the current row to compute my result, that is horrible cache-wise. The cache will hold the current row (or part of it), then I ask for the previous one which it will probably not hold anymore.
Then for another one, I need a different part of the diagonal, so yet again, I ask for completely different rows and the cache will not have those ready for me.
Therefore, I would like to save my matrix to memory in blocks or maybe diagoals. That will result in fewer cachce misses and make my implementation faster again.
How do you do that? I tried searching the internet, but I could never find anything that would show me the way. Is it possible to tell c++ how to order that type in memory?
EDIT: As some of you seem confused about the nature of my question. I want to save a matrix (does not matter if I will make it a 2D array or any other way) in a custom way into the MEMORY. Normally, a 2D array will save row after row, I need to work with diagonals therefore caches will miss a lot on the huge matrices I will work at (possibly millions of rows and columns).

I believe you may have a mis-perception of (CPU) cache.
It's true that CPU caching is linear - that is, if you access an address in memory, it will bring into the cache some previous and some successive memory locations - which is like "guessing" that subsequent accesses will involve 1-dimensional-close elements. However, this is true on the micro-level. A CPU's cache is made up of a large number of small "lines" (64 Bytes on all cache levels in recent Intel CPUs). The locality is limited to the line; different cache lines can come from completely different places in memory.
Thus, if you "need two elements of the previous row and one element of the current row" of your matrix, then the cache should work very well for you: Some of the cache will hold elements of the previous row, and some will hold elements of the current row. And when you advance to the next element, the cache overall will usually contain the matrix elements you need to access. Just make sure your order of iteration agrees with the order of progression within the cache line.
Also, in some cases you could be faced with a situation where different threads are thrashing the same cache lines due to the mapping from main memory into the cache. Without getting into details, that is something you need to think about (but again, has nothing to do with 2D vs 1D data).
Edit: As geza notes, if your matrix' lines are long, you will still be reading each memory location twice with the straightforward approach: Once as the current-line, then again as the previous-line, since each value will be evicted from the cache before it's used as a previous-line value. If you want to avoid this, you can iterate over tiles of your matrix, whose size (length x width x sizeof(element)) fits into the L1 cache (along with whatever else needs to be there). You can also consider storing your data in tiles, but I don't think that would be too useful.

Preliminary comment: "Levenshtein distance" is edit distance (under the common definition). This is a very common problem; you probably don't even need to bother writing a solution yourself. Look for existing code.
Now, finally, for a proper answer... You don't actually need have a matrix at all, and you certainly don't need to "save" it: It's enough to keep merely a "front" of your dynamic programming matrix rather than the whole thing.
But what "front" shall you choose, and how do you advance it? I suggest you use anti-diagonals as your front, and given each anti-diagonal, compute concurrently the next anti-diagonal. Thus it'll be {(0,0)}, then {(0,1),(1,0)}, then {(0,2),(1,1),(2,0)} and so on. Each anti-diagonal requires at most two earlier anti-diagonals - and if we keep the values of each anti-diagonal consecutively in memory, then the access pattern going up the next anti-diagonal is a linear progression along the previous anti-diagonals - which is great for the cache (see my other answer).
So, you'll "concurrentize" the computation give each thread a bunch of consecutive anti-diagonal elements to compute; that should do the trick. And at any time you will only keep 3 anti-diagonal in memory: the one you're working on and the two previous ones. You can cycle between three such buffers so you don't re-allocate memory all the time (but then make sure to pre-allocate buffers with the maximum anti-diagonal length).
This whole thing should work basically the same for the non-square case.

I'm not absolutely sure, but i think a matrix is stored as a long array one row after the other and is mapped with pointer arithmetic to a matrix, so you always refer to the same address and calculate the distance in the memory where your value is located
Otherwise you can implement it easily as this type and implement operator[int, int] for your matrix

High performing container for storing a high number of objects

I am looking for an ideal data container for with following objectives:
The behavior of the container must be sort of like Queue, with the following specifications:
1) random access is not a must
2) iterating over the objects in two directions must be super fast ( contiguous data would be better)
3) high performing delete from the front of the list and insert in the back is a must ( a high number of deletes and appends are done at every time step )
4) items are not primitive types, they are objects.
I know double-linked lists are not high performing containers.
vectors (like std::vector in c++) are good, but it is not really optimized for deleting from the front, also I don't think vectorization is possible at all given the size of objects.
I was also looking at the possibility of Slot-Map container, but not sure if it is the best option.
I was wondering if there are better options available?

You might be able to get away with just a regular vector and a start index that tells you where the "real" beginning of your data is.
to append to the back, use the regular method. This has an amortized constant-time complexity, which is probably fine for you given that you will be doing lots of pushing.
to delete from the front, increment start.
to access element i, use vector[start + i].
whenever you delete from anywhere but the front, or insert anywhere except the back, go ahead and recreate the whole vector without any leading deleted entries and reset start to zero.
Pros:
entries are in a contiguous chunk of memory
fast delete from the front and (amortized) fast insert into the back
fast random access and fast iteration
Cons:
slow worst-case insertion behavior
potentially lots of wasted space unless cleaned up periodically
cleaning up on deletes changes delete's worst-case behavior to linear, slow.
Whatever you do, consider comparing to the natural approach: a doubly-linked list with the head and tail remembered.
fast inserts/deletes from the front/back
no wasted space
True, the items will not be contiguous in memory so there is a potential for more cache misses; however, you could combat this with occasional defragmentation:
allocate enough contiguous space for all nodes in the list
recreate nodes in order by traversing links
release the original nodes and use the new set of nodes as the list
Depending on the pattern of deletes/inserts/traversals, this could be feasible.

If we really care about the performance, the container should never allocate any memory dynamically, i.e. we should define an upper limit of objects in the container.
The interface requirements is queueish indeed, so it looks like the fastest option would be circular queue of pointers to objects. So the container should shave the following fields:
OBJECT * ptrs[SIZE] -- fixed size array of pointers. Sure, we will waste SIZE * sizeof (OBJECT *) bytes here, but performance wise it could be a good trade.
size_t head_idx -- head object index.
size_t tail_idx -- tail object index.
iterating over the objects in two directions must be super fast
Next object is a next index in the ptrs[]:
if (cur_idx >= head_idx) return nullptr;
return ptrs[(cur_idx++) % SIZE]; // make sure SIZE is a power of 2 constant
Prev object is a prev index in the ptrs[]:
if (cur_idx <= tail_idx) return nullptr;
return ptrs[(cur_idx--) % SIZE]; // make sure SIZE is a power of 2 constant
high performing delete from the front of the list and insert in the back is a must
The pop_front() would be as simple as:
if (tail_idx == head_idx) ... // should not happen, through an error
head_idx++;
The push_back() would be as simple as:
if (tail_idx - head_idx >= SIZE) ... // should not happen, through an error
ptrs[(tail_idx++) % SIZE] = obj_ptr; // make sure SIZE if a power of 2 constant
items are not primitive types, they are objects
The most generic solution would be to simply store pointers in the cyclic queue, so the size of the object does not matter and you waste just SIZE times pointer, not SIZE times object. But sure, if you can afford to preallocate thousands of objects, it should be even faster...
Those are kind of speculations based on your performance requirements. I am not sure if you can afford to trade some memory for the performance, so I am sorry if it is not the case...

Maintaining a list of regions without overlapps

I have a list of integer axis aligned cuboids that is being built and then processed (a dirty region system).
Currently this will often have overlaps with some coordinates getting processed many times as a result (although still far less in total than the process everything due to 1 change approach). What I want to do is when adding a new region to the list, is to have a simple way to prevent any such resulting overlaps.
Due to the size of the data (iirc about 100 million cells), even though the coordinates are integers, I want to avoid a bool array of every coordinate to mark it uptodate/dirty. On the other hand, the actual number of regions in the list will generally be pretty small (most of the time only covering a fraction of the data set, with individual regions being 1000's of cells).
struct Region
{
int x, y, z;//corner coordinate
int w, h, d;//size
};
void addRegion(Region region)
{
regions.push_back(region);
}
So my current thinking is in addRegion to go through all the regions, find the overlapping ones and split them up appropriately. However even in 2D this seems tricky to come up with, so is there a known algorithm for this sort of thing?

You might be able to make use of an r-tree or r-tree variant, which is designed for indexing multidimensional data and has support for a fast intersection test; and given the size of your dataset, you might instead want to use a spatial database.

Draw Mandelbrot using SIMD

I'm looking to optimise generating buddhabrots and to do so I read about SIMD and parallel computing. Is it possible to use this to speed up the generation of my buddhabrots. I'm programming in C

Yes, Buddhabrot generation can be easily parallelized. The key is to separate the computation from the rendering. The computation begins with a 2D array of counters, one per pixel, initialized to all zeros. A processor can then increment those counters while computing random trajectories. You can parallelize this in SIMD fashion by having multiple processors each doing this starting with different random seeds and periodically dumping those arrays into files. When you think they may have done this enough for a satisfying result, you simply gather all those files and create a master array that contains the sums of all the others. Only then would you perform histogram equalization on the final array and render the result by assigning colors to each range of values in the histogram. If you find that the result is not "cooked" to your satisfaction, you can simply continue the calculations or create more files to be summed and rendered.

Indeed many have worked on this. This an example that works pretty well. There are others.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio