Speed up reshape/ not use reshape Matlab - performance

I have this operation which is called multiple times:
longRowVector;
matrix = reshape(longRowVector, n, n)';
answer = matrix(:);
This operation using reshape is slow. Is there a way to get to answer without using reshape.

There is no easy way to speed that up. if n exceeds a certain number (defined by your relevant cache size), the way in which the memory accesses will be ordered during the transpose operator. The cost is actually create in the transpose operation. Below i plot this cost for different matrix sizes. There is a jump at around 360, which is consistent with the cache size on my processor.
If you want to avoid this hit, then you need to create your "cache-optimized" reordering strategy, i.e. perform the reordering in m*m tiles where both of the vectors will fit in the cache.

Related

Is MPI shared memory a good solution for my problem?

I have several processes, each of them calculates certain sub-matrices of one global matrix. The problem is that the sub-matrices will overlap and in general they do not necessarily have to form a continuous block within the global matrix. Also each tasks might also have more than one sub-matrix.
Finally, in order to obtain my final matrix I need to perform an element wise summation of these sub-matrices by considering the position within the global matrix.
So far I am doing the following:
each processor has its own copy of the global array (matrix)
each processor then calculates a sub-matrix of that global matrix and adds the elements to the right position in the local copy of the global array
with mpi_allreduce I am obtaining the final global matrix synchronized over all the tasks (this is my element wise summation to obtain my final result)
This works reasonably well as long as my global matrix is small. However, this becomes quickly a memory bottleneck as allocating local copies of the global matrix becomes more and more expensive.
One constraint is that I have to solve this with MPI only.
Another constraint is that I need to perform operations on that global matrix afterwards. Where different task have access (this time read-only) different parts of that global matrix. The blocks are not the same as the sub-matrix blocks before.
I somehow stumbled along MPI-3 shared memory arrays. However, I am not sure if this might be the best solution for my problem as several processes have to add simultaneously small local arrays which overlap. However, for my operations afterwards, each process could also read from that global matrix again.
I am relatively inexperienced how to solve these kind of problems and I would be happy for any kind of suggestions.
Thanks!

How to save a matrix in C++ in a non-linear way

I have to program an optimized multi-thread implementation of the Levenshtein distance problem. It can be computed using dynamic programming with a matrix, the wikipedia page on Levenshtein distance covers that well enough.
Now, I can compute diagonal elements concurrently. That is all alright.
My problem now comes with caches. Matrices in c++ are normaly saved in memory row by row, correct? Well, that is not good for me as I need 2 element of the previous row and 1 element of the current row to compute my result, that is horrible cache-wise. The cache will hold the current row (or part of it), then I ask for the previous one which it will probably not hold anymore.
Then for another one, I need a different part of the diagonal, so yet again, I ask for completely different rows and the cache will not have those ready for me.
Therefore, I would like to save my matrix to memory in blocks or maybe diagoals. That will result in fewer cachce misses and make my implementation faster again.
How do you do that? I tried searching the internet, but I could never find anything that would show me the way. Is it possible to tell c++ how to order that type in memory?
EDIT: As some of you seem confused about the nature of my question. I want to save a matrix (does not matter if I will make it a 2D array or any other way) in a custom way into the MEMORY. Normally, a 2D array will save row after row, I need to work with diagonals therefore caches will miss a lot on the huge matrices I will work at (possibly millions of rows and columns).
I believe you may have a mis-perception of (CPU) cache.
It's true that CPU caching is linear - that is, if you access an address in memory, it will bring into the cache some previous and some successive memory locations - which is like "guessing" that subsequent accesses will involve 1-dimensional-close elements. However, this is true on the micro-level. A CPU's cache is made up of a large number of small "lines" (64 Bytes on all cache levels in recent Intel CPUs). The locality is limited to the line; different cache lines can come from completely different places in memory.
Thus, if you "need two elements of the previous row and one element of the current row" of your matrix, then the cache should work very well for you: Some of the cache will hold elements of the previous row, and some will hold elements of the current row. And when you advance to the next element, the cache overall will usually contain the matrix elements you need to access. Just make sure your order of iteration agrees with the order of progression within the cache line.
Also, in some cases you could be faced with a situation where different threads are thrashing the same cache lines due to the mapping from main memory into the cache. Without getting into details, that is something you need to think about (but again, has nothing to do with 2D vs 1D data).
Edit: As geza notes, if your matrix' lines are long, you will still be reading each memory location twice with the straightforward approach: Once as the current-line, then again as the previous-line, since each value will be evicted from the cache before it's used as a previous-line value. If you want to avoid this, you can iterate over tiles of your matrix, whose size (length x width x sizeof(element)) fits into the L1 cache (along with whatever else needs to be there). You can also consider storing your data in tiles, but I don't think that would be too useful.
Preliminary comment: "Levenshtein distance" is edit distance (under the common definition). This is a very common problem; you probably don't even need to bother writing a solution yourself. Look for existing code.
Now, finally, for a proper answer... You don't actually need have a matrix at all, and you certainly don't need to "save" it: It's enough to keep merely a "front" of your dynamic programming matrix rather than the whole thing.
But what "front" shall you choose, and how do you advance it? I suggest you use anti-diagonals as your front, and given each anti-diagonal, compute concurrently the next anti-diagonal. Thus it'll be {(0,0)}, then {(0,1),(1,0)}, then {(0,2),(1,1),(2,0)} and so on. Each anti-diagonal requires at most two earlier anti-diagonals - and if we keep the values of each anti-diagonal consecutively in memory, then the access pattern going up the next anti-diagonal is a linear progression along the previous anti-diagonals - which is great for the cache (see my other answer).
So, you'll "concurrentize" the computation give each thread a bunch of consecutive anti-diagonal elements to compute; that should do the trick. And at any time you will only keep 3 anti-diagonal in memory: the one you're working on and the two previous ones. You can cycle between three such buffers so you don't re-allocate memory all the time (but then make sure to pre-allocate buffers with the maximum anti-diagonal length).
This whole thing should work basically the same for the non-square case.
I'm not absolutely sure, but i think a matrix is stored as a long array one row after the other and is mapped with pointer arithmetic to a matrix, so you always refer to the same address and calculate the distance in the memory where your value is located
Otherwise you can implement it easily as this type and implement operator[int, int] for your matrix

Performing different tasks for different data items in OpenCL?

In summary, I'm looking for ways to deal with a situation where the very first step in the calculation is a conditional branch between two computationally expensive branches.
I'm essentially trying to implement a graphics filter that operates on an image and a mask - the mask is a bitmap array the same size as the image, and the filter performs different operations according to the value of the mask. So I basically want to do something like this for each pixel:
if(mask == 1) {
foo();
} else {
bar();
}
where both foo and bar are fairly expensive operations. As I understand it, when I run this code on the GPU it will have to calculate both branches for every pixel. (This gets even more expensive if there are more than two possible values for the mask.) Is there any way to avoid this?
One option I can think of would be to, in the host code, sort all the pixels into two 1-dimensional arrays based on the value of the mask at that point, and then entirely different kernels on them; then reconstruct the image from the two datasets afterwards. The problem with this is that, in my case, I want to run the filter iteratively, and both the image and the mask change with each iteration (the mask is actually calculated from the image). If I'm splitting the image into two buckets in the host code, I have to transfer each iteration of the image and mask from the GPU, and then the new buckets back to the GPU, introducing a new bottleneck to replace the old one.
Is there any other way to avoid this bottleneck?
Another approach might be to do a simple bucket sort within each work-group using the mask.
So add a local memory array and atomic counter for each value of mask. First read a pixel (or set of pixels might be better) for each work item, increment the appropriate atomic count and write the pixel address into that location in the array.
Then perform a work-group barrier.
Then as a final stage assign some set of work-items, maybe a multiple of the underlying vector size, to each of those arrays and iterate through it. Your operations will then be largely efficient, barring some loss at the ends, and if you look at enough pixels per work-item you may have very little loss of efficiency even if you assign the entire group to one mask value and then the other in turn.
Given that your description only has two values of the mask, fitting two arrays into local memory should be pretty simple and scale well.
Push demanding task of a thread to shared/local memory(synchronization slows the process) and execute light ones untill all light ones finish(so the slow sync latency is hidden by this), then execute heavier ones.
if(mask == 1) {
uploadFoo();//heavy, upload to __local object[]
} else {
processBar(); // compute until, then check for a foo() in local memory if any exists.
downloadFoo();
}
using a producer - consumer approach maybe.

Determine offset where the most constructive interference occurs

I have two arrays of data:
I would like to align these similar graphs together (by adding an offset to either array):
Essentially what I want is the most constructive interference, as shown when two waves together produce the same wave but with larger amplitude:
This is also the same as finding the most destructive interference, but one of the arrays must be inverted as shown:
Notice that the second wave is inverted (peaks become troughs / vice-versa).
The actual data will not only consist of one major and one minor peak and trough, but of many, and there might not be any noticeable spikes. I have made the data in the diagram simpler to show how I would like the data aligned.
I was thinking about a few loops, such as:
biggest = 0
loop from -10 to 10 as offset
count = 0
loop through array1 as ar1
loop through array2 as ar2
count += array1[ar1] + array2[ar2 - offset]
replace biggest with count if count/sizeof(array1) > biggest
However, that requires looping through offset and looping through both arrays. My real array definitions are extremely large and this would would take too long.
How would I go about determining the offset required to match data1 with data2?
JSFiddle (note that this is language agnostic and I would like to understand the algorithm more-so than the actual code)
Look at Convolution and Cross-correlation an its computation using Fast Fourier Transformation. It's the way how it is done in real life applications.
If (and only if) you data has very recognizeable spikes, you could do, what a human being would do: Match the spikes: Fiddle
the importand part is function matchData()
An improved version would search for N max and min spikes, then calculate an average offset.

Draw Mandelbrot using SIMD

I'm looking to optimise generating buddhabrots and to do so I read about SIMD and parallel computing. Is it possible to use this to speed up the generation of my buddhabrots. I'm programming in C
Yes, Buddhabrot generation can be easily parallelized. The key is to separate the computation from the rendering. The computation begins with a 2D array of counters, one per pixel, initialized to all zeros. A processor can then increment those counters while computing random trajectories. You can parallelize this in SIMD fashion by having multiple processors each doing this starting with different random seeds and periodically dumping those arrays into files. When you think they may have done this enough for a satisfying result, you simply gather all those files and create a master array that contains the sums of all the others. Only then would you perform histogram equalization on the final array and render the result by assigning colors to each range of values in the histogram. If you find that the result is not "cooked" to your satisfaction, you can simply continue the calculations or create more files to be summed and rendered.
Indeed many have worked on this. This an example that works pretty well. There are others.

Resources