I have several processes, each of them calculates certain sub-matrices of one global matrix. The problem is that the sub-matrices will overlap and in general they do not necessarily have to form a continuous block within the global matrix. Also each tasks might also have more than one sub-matrix.
Finally, in order to obtain my final matrix I need to perform an element wise summation of these sub-matrices by considering the position within the global matrix.
So far I am doing the following:
each processor has its own copy of the global array (matrix)
each processor then calculates a sub-matrix of that global matrix and adds the elements to the right position in the local copy of the global array
with mpi_allreduce I am obtaining the final global matrix synchronized over all the tasks (this is my element wise summation to obtain my final result)
This works reasonably well as long as my global matrix is small. However, this becomes quickly a memory bottleneck as allocating local copies of the global matrix becomes more and more expensive.
One constraint is that I have to solve this with MPI only.
Another constraint is that I need to perform operations on that global matrix afterwards. Where different task have access (this time read-only) different parts of that global matrix. The blocks are not the same as the sub-matrix blocks before.
I somehow stumbled along MPI-3 shared memory arrays. However, I am not sure if this might be the best solution for my problem as several processes have to add simultaneously small local arrays which overlap. However, for my operations afterwards, each process could also read from that global matrix again.
I am relatively inexperienced how to solve these kind of problems and I would be happy for any kind of suggestions.
Thanks!
Related
I have to program an optimized multi-thread implementation of the Levenshtein distance problem. It can be computed using dynamic programming with a matrix, the wikipedia page on Levenshtein distance covers that well enough.
Now, I can compute diagonal elements concurrently. That is all alright.
My problem now comes with caches. Matrices in c++ are normaly saved in memory row by row, correct? Well, that is not good for me as I need 2 element of the previous row and 1 element of the current row to compute my result, that is horrible cache-wise. The cache will hold the current row (or part of it), then I ask for the previous one which it will probably not hold anymore.
Then for another one, I need a different part of the diagonal, so yet again, I ask for completely different rows and the cache will not have those ready for me.
Therefore, I would like to save my matrix to memory in blocks or maybe diagoals. That will result in fewer cachce misses and make my implementation faster again.
How do you do that? I tried searching the internet, but I could never find anything that would show me the way. Is it possible to tell c++ how to order that type in memory?
EDIT: As some of you seem confused about the nature of my question. I want to save a matrix (does not matter if I will make it a 2D array or any other way) in a custom way into the MEMORY. Normally, a 2D array will save row after row, I need to work with diagonals therefore caches will miss a lot on the huge matrices I will work at (possibly millions of rows and columns).
I believe you may have a mis-perception of (CPU) cache.
It's true that CPU caching is linear - that is, if you access an address in memory, it will bring into the cache some previous and some successive memory locations - which is like "guessing" that subsequent accesses will involve 1-dimensional-close elements. However, this is true on the micro-level. A CPU's cache is made up of a large number of small "lines" (64 Bytes on all cache levels in recent Intel CPUs). The locality is limited to the line; different cache lines can come from completely different places in memory.
Thus, if you "need two elements of the previous row and one element of the current row" of your matrix, then the cache should work very well for you: Some of the cache will hold elements of the previous row, and some will hold elements of the current row. And when you advance to the next element, the cache overall will usually contain the matrix elements you need to access. Just make sure your order of iteration agrees with the order of progression within the cache line.
Also, in some cases you could be faced with a situation where different threads are thrashing the same cache lines due to the mapping from main memory into the cache. Without getting into details, that is something you need to think about (but again, has nothing to do with 2D vs 1D data).
Edit: As geza notes, if your matrix' lines are long, you will still be reading each memory location twice with the straightforward approach: Once as the current-line, then again as the previous-line, since each value will be evicted from the cache before it's used as a previous-line value. If you want to avoid this, you can iterate over tiles of your matrix, whose size (length x width x sizeof(element)) fits into the L1 cache (along with whatever else needs to be there). You can also consider storing your data in tiles, but I don't think that would be too useful.
Preliminary comment: "Levenshtein distance" is edit distance (under the common definition). This is a very common problem; you probably don't even need to bother writing a solution yourself. Look for existing code.
Now, finally, for a proper answer... You don't actually need have a matrix at all, and you certainly don't need to "save" it: It's enough to keep merely a "front" of your dynamic programming matrix rather than the whole thing.
But what "front" shall you choose, and how do you advance it? I suggest you use anti-diagonals as your front, and given each anti-diagonal, compute concurrently the next anti-diagonal. Thus it'll be {(0,0)}, then {(0,1),(1,0)}, then {(0,2),(1,1),(2,0)} and so on. Each anti-diagonal requires at most two earlier anti-diagonals - and if we keep the values of each anti-diagonal consecutively in memory, then the access pattern going up the next anti-diagonal is a linear progression along the previous anti-diagonals - which is great for the cache (see my other answer).
So, you'll "concurrentize" the computation give each thread a bunch of consecutive anti-diagonal elements to compute; that should do the trick. And at any time you will only keep 3 anti-diagonal in memory: the one you're working on and the two previous ones. You can cycle between three such buffers so you don't re-allocate memory all the time (but then make sure to pre-allocate buffers with the maximum anti-diagonal length).
This whole thing should work basically the same for the non-square case.
I'm not absolutely sure, but i think a matrix is stored as a long array one row after the other and is mapped with pointer arithmetic to a matrix, so you always refer to the same address and calculate the distance in the memory where your value is located
Otherwise you can implement it easily as this type and implement operator[int, int] for your matrix
I have this operation which is called multiple times:
longRowVector;
matrix = reshape(longRowVector, n, n)';
answer = matrix(:);
This operation using reshape is slow. Is there a way to get to answer without using reshape.
There is no easy way to speed that up. if n exceeds a certain number (defined by your relevant cache size), the way in which the memory accesses will be ordered during the transpose operator. The cost is actually create in the transpose operation. Below i plot this cost for different matrix sizes. There is a jump at around 360, which is consistent with the cache size on my processor.
If you want to avoid this hit, then you need to create your "cache-optimized" reordering strategy, i.e. perform the reordering in m*m tiles where both of the vectors will fit in the cache.
In summary, I'm looking for ways to deal with a situation where the very first step in the calculation is a conditional branch between two computationally expensive branches.
I'm essentially trying to implement a graphics filter that operates on an image and a mask - the mask is a bitmap array the same size as the image, and the filter performs different operations according to the value of the mask. So I basically want to do something like this for each pixel:
if(mask == 1) {
foo();
} else {
bar();
}
where both foo and bar are fairly expensive operations. As I understand it, when I run this code on the GPU it will have to calculate both branches for every pixel. (This gets even more expensive if there are more than two possible values for the mask.) Is there any way to avoid this?
One option I can think of would be to, in the host code, sort all the pixels into two 1-dimensional arrays based on the value of the mask at that point, and then entirely different kernels on them; then reconstruct the image from the two datasets afterwards. The problem with this is that, in my case, I want to run the filter iteratively, and both the image and the mask change with each iteration (the mask is actually calculated from the image). If I'm splitting the image into two buckets in the host code, I have to transfer each iteration of the image and mask from the GPU, and then the new buckets back to the GPU, introducing a new bottleneck to replace the old one.
Is there any other way to avoid this bottleneck?
Another approach might be to do a simple bucket sort within each work-group using the mask.
So add a local memory array and atomic counter for each value of mask. First read a pixel (or set of pixels might be better) for each work item, increment the appropriate atomic count and write the pixel address into that location in the array.
Then perform a work-group barrier.
Then as a final stage assign some set of work-items, maybe a multiple of the underlying vector size, to each of those arrays and iterate through it. Your operations will then be largely efficient, barring some loss at the ends, and if you look at enough pixels per work-item you may have very little loss of efficiency even if you assign the entire group to one mask value and then the other in turn.
Given that your description only has two values of the mask, fitting two arrays into local memory should be pretty simple and scale well.
Push demanding task of a thread to shared/local memory(synchronization slows the process) and execute light ones untill all light ones finish(so the slow sync latency is hidden by this), then execute heavier ones.
if(mask == 1) {
uploadFoo();//heavy, upload to __local object[]
} else {
processBar(); // compute until, then check for a foo() in local memory if any exists.
downloadFoo();
}
using a producer - consumer approach maybe.
I have a 2D matrix where I want to modify every value by applying a function that is only dependent on the coordinates in the matrix and values set at compile-time. Since no synchronization is necessary between each such calculation, it seems to me like the work group size could really be 1, and the number of work groups equal to the number of elements in the matrix.
My question is whether this will actually yield the desired result, or whether other forces are at play here that might make a different setting for these values better?
My recomendation: Just set global size to your 2D matrix size, and local size to NULL. This will make the compiler select for you an optimal local size.
In your specific case, the local size does not need to hav any shape. In fact, any value value will do the work, but the performance may differ. You can tune it manually for different HW. But it is easyer to let the compiler do this job for you. And it is even more portable.
I'm looking to optimise generating buddhabrots and to do so I read about SIMD and parallel computing. Is it possible to use this to speed up the generation of my buddhabrots. I'm programming in C
Yes, Buddhabrot generation can be easily parallelized. The key is to separate the computation from the rendering. The computation begins with a 2D array of counters, one per pixel, initialized to all zeros. A processor can then increment those counters while computing random trajectories. You can parallelize this in SIMD fashion by having multiple processors each doing this starting with different random seeds and periodically dumping those arrays into files. When you think they may have done this enough for a satisfying result, you simply gather all those files and create a master array that contains the sums of all the others. Only then would you perform histogram equalization on the final array and render the result by assigning colors to each range of values in the histogram. If you find that the result is not "cooked" to your satisfaction, you can simply continue the calculations or create more files to be summed and rendered.
Indeed many have worked on this. This an example that works pretty well. There are others.