Draw Mandelbrot using SIMD - parallel-processing

I'm looking to optimise generating buddhabrots and to do so I read about SIMD and parallel computing. Is it possible to use this to speed up the generation of my buddhabrots. I'm programming in C

Yes, Buddhabrot generation can be easily parallelized. The key is to separate the computation from the rendering. The computation begins with a 2D array of counters, one per pixel, initialized to all zeros. A processor can then increment those counters while computing random trajectories. You can parallelize this in SIMD fashion by having multiple processors each doing this starting with different random seeds and periodically dumping those arrays into files. When you think they may have done this enough for a satisfying result, you simply gather all those files and create a master array that contains the sums of all the others. Only then would you perform histogram equalization on the final array and render the result by assigning colors to each range of values in the histogram. If you find that the result is not "cooked" to your satisfaction, you can simply continue the calculations or create more files to be summed and rendered.

Indeed many have worked on this. This an example that works pretty well. There are others.

Related

Is MPI shared memory a good solution for my problem?

I have several processes, each of them calculates certain sub-matrices of one global matrix. The problem is that the sub-matrices will overlap and in general they do not necessarily have to form a continuous block within the global matrix. Also each tasks might also have more than one sub-matrix.
Finally, in order to obtain my final matrix I need to perform an element wise summation of these sub-matrices by considering the position within the global matrix.
So far I am doing the following:
each processor has its own copy of the global array (matrix)
each processor then calculates a sub-matrix of that global matrix and adds the elements to the right position in the local copy of the global array
with mpi_allreduce I am obtaining the final global matrix synchronized over all the tasks (this is my element wise summation to obtain my final result)
This works reasonably well as long as my global matrix is small. However, this becomes quickly a memory bottleneck as allocating local copies of the global matrix becomes more and more expensive.
One constraint is that I have to solve this with MPI only.
Another constraint is that I need to perform operations on that global matrix afterwards. Where different task have access (this time read-only) different parts of that global matrix. The blocks are not the same as the sub-matrix blocks before.
I somehow stumbled along MPI-3 shared memory arrays. However, I am not sure if this might be the best solution for my problem as several processes have to add simultaneously small local arrays which overlap. However, for my operations afterwards, each process could also read from that global matrix again.
I am relatively inexperienced how to solve these kind of problems and I would be happy for any kind of suggestions.
Thanks!

Performing different tasks for different data items in OpenCL?

In summary, I'm looking for ways to deal with a situation where the very first step in the calculation is a conditional branch between two computationally expensive branches.
I'm essentially trying to implement a graphics filter that operates on an image and a mask - the mask is a bitmap array the same size as the image, and the filter performs different operations according to the value of the mask. So I basically want to do something like this for each pixel:
if(mask == 1) {
foo();
} else {
bar();
}
where both foo and bar are fairly expensive operations. As I understand it, when I run this code on the GPU it will have to calculate both branches for every pixel. (This gets even more expensive if there are more than two possible values for the mask.) Is there any way to avoid this?
One option I can think of would be to, in the host code, sort all the pixels into two 1-dimensional arrays based on the value of the mask at that point, and then entirely different kernels on them; then reconstruct the image from the two datasets afterwards. The problem with this is that, in my case, I want to run the filter iteratively, and both the image and the mask change with each iteration (the mask is actually calculated from the image). If I'm splitting the image into two buckets in the host code, I have to transfer each iteration of the image and mask from the GPU, and then the new buckets back to the GPU, introducing a new bottleneck to replace the old one.
Is there any other way to avoid this bottleneck?
Another approach might be to do a simple bucket sort within each work-group using the mask.
So add a local memory array and atomic counter for each value of mask. First read a pixel (or set of pixels might be better) for each work item, increment the appropriate atomic count and write the pixel address into that location in the array.
Then perform a work-group barrier.
Then as a final stage assign some set of work-items, maybe a multiple of the underlying vector size, to each of those arrays and iterate through it. Your operations will then be largely efficient, barring some loss at the ends, and if you look at enough pixels per work-item you may have very little loss of efficiency even if you assign the entire group to one mask value and then the other in turn.
Given that your description only has two values of the mask, fitting two arrays into local memory should be pretty simple and scale well.
Push demanding task of a thread to shared/local memory(synchronization slows the process) and execute light ones untill all light ones finish(so the slow sync latency is hidden by this), then execute heavier ones.
if(mask == 1) {
uploadFoo();//heavy, upload to __local object[]
} else {
processBar(); // compute until, then check for a foo() in local memory if any exists.
downloadFoo();
}
using a producer - consumer approach maybe.

How to sort a small amount(around 100~200) of (~16~bit) numbers on GPU CUDA very very fast?

Hi is part of my project. The key is that I have to sort a array of numbers (say 100~200 16 bits numbers, numbers and bits are fixed before hand). I want to sort this within one block using shared memory of GPU, as part of the requirement the data could not go off-chip in middle.
I have read some radix sorting, bitonic sorting algorithm on GPU. But it looks like they are designed for large amount of numbers. I want to sort this 100~200 numbers very very quickly
I appreciate any idea/help
CUB has routines to do sorting operations within a single block, including examples. CUB also has routines that can run at the device level if you desire, to leverage all of the compute resources for larger problem sizes.
For 100+ numbers to be sorted, you're not likely to be able to write your own code that runs any faster.
If you have a very small group (say < 32) numbers to sort at the block level, you may simply want to write your own code. In that case, the warp vote and shuffle instructions are likely to be useful.

Building a histogram faster

I am working with a large dataset that I need to build a histogram of. I feel like my method of just going through the entire list and marking in a second array the frequency is a slow approach. Any suggestions on how to speed the process up?
Given that a histogram is a graph containing the counts of all items in each bin, you can't make one without visiting all the items.
However, you can:
Create the histogram as you collect the data. Then it takes no time to generate.
Break up the data into N parts, and work on each part in parallel. When each part is done counting, just sum the results for each bin. (You can also combine this with #1)
Sample the data. In theory, looking at a fraction of your data, you should be able to estimate the rest of it. The Math.

Hadoop. Reducing result to the single value

I started learning Hadoop, and am a bit confused by MapReduce. For tasks where result natively is a list of key-value pairs everything seems clear. But I don't understand how should I solve the tasks where result is a single value (say, sum of squared input decimals, or centre of mass for input points).
On the one hand I can put all results of mapper to the same key. But as far as I understood in this case the only reducer will manage the whole set of data (calculate sum, or mean coordinates). It doesn't look like a good solution.
Another one that I can imaging is to group mapper results. Say, mapper that processed examples 0-999 will produce key equals to 0, 1000-1999 will produce key equals to 1, and so on. As far as there still will be multiple results of reducers, it will be necessary to build chain of reducers (reducing will be repeated until only one result remains). It looks much more computational effective, but a bit complicated.
I still hope that Hadoop has the off-the-shelf tool that executes superposition of reducers to maximise the efficiency of reducing the whole data to a single value. Although I failed to find one.
What is the best practise of solving the tasks where result is a single value?
If you are able to reformulate your task in terms of commutative reduce you should look at Combiners. Any way you should take a look on it, it can significantly reduce amount data to shuffle.
From my point of view, you are tackling the problem from the wrong angle.
See that problem where you need to sum the squares of your input, let's assume you have many and large text input files consisting out of a number per line.
Then ideally you want to parallelize your sums in the mapper and then just sum up the sums in the reducer.
e.G:
map: (input "x", temporary sum "s") -> s+=(x*x)
At the end of map, you would emit that temporary sum of every mapper with a global key.
In the reduce stage, you basically get all the sums from your mappers and sum the sums up, note that this is fairly small (n-times a single integer, where n is the number of mappers) in relation to your huge input files and therefore a single reducer is really not a scalability bottleneck.
You want to cut down the communication cost between the mapper and the reducer, not proxy all your data to a single reducer and read through it there, that would not parallelize anything.
I think your analysis of the specific use cases you bring up are spot on. These use cases still fall into a rather inclusive scope of what you can do with hadoop and there are certainly other things that hadoop just wasn't designed to handle. If I had to solve the same problem, I would follow your first approach unless I knew the data was too big, then I'd follow your two-step approach.

Resources