Async doesn't work for long vectors - performance

I am doing some parallel programming with async. I have an integrator and in a test program I wanted to see whether if dividing a vector in 4 subvectors actually takes one fourth of the time to complete the task.
I had an initial issue about the time measured, now solved as steady_clock() measures real and not CPU time.
I tried the code with different vector lenghts. For short lenghts (<10e5 elements) the direct integration is faster: normal, as the .get() calls and the sum take their time.
For intermediate lenghts (about 1e8 elements) the integration followed the expected time, giving 1 s as the first time and 0.26 s for the second time.
For long vectors(10e9 or higher) the second integration takes much more time than the first, more than 3 s against a similar or greater time.
Why? What is the process that makes the divide and conquer routine slower?
A couple of additional notes: Please note that I pass the vectors by reference, so that cannot be the issue, and keep in mind that this is a test code, thus the subvector creation is not the point of the question.
#include<iostream>
#include<vector>
#include<thread>
#include<future>
#include<ctime>
#include<chrono>
using namespace std;
using namespace chrono;
typedef steady_clock::time_point tt;
double integral(const std::vector<double>& v, double dx) //simpson 1/3
{
int n=v.size();
double in=0.;
if(n%2 == 1) {in+=v[n-1]*v[n-1]; n--;}
in=(v[0]*v[0])+(v[n-1]*v[n-1]);
for(int i=1; i<n/2; i++)
in+= 2.*v[2*i] + 4.*v[2*i+1];
return in*dx/3.;
}
int main()
{
double h=0.001;
vector<double> v1(100000,h); // a vector, content is not important
// subvectors
vector<double> sv1(v1.begin(), v1.begin() + v1.size()/4),
sv2(v1.begin() + v1.size()/4 +1,v1.begin()+ 2*v1.size()/4),
sv3( v1.begin() + 2*v1.size()/4+1, v1.begin() + 3*v1.size()/4+1),
sv4( v1.begin() + 3*v1.size()/4+1, v1.end());
double a,b;
cout << "f1" << endl;
tt bt1 = chrono::steady_clock::now();
// complete integration: should take time t
a=integral(v1, h);
tt et1 = chrono::steady_clock::now();
duration<double> time_span = duration_cast<duration<double>>(et1 - bt1);
cout << time_span.count() << endl;
future<double> f1, f2,f3,f4;
cout << "f2" << endl;
tt bt2 = chrono::steady_clock::now();
// four integrations: should take time t/4
f1 = async(launch::async, integral, ref(sv1), h);
f2 = async(launch::async, integral, ref(sv2), h);
f3 = async(launch::async, integral, ref(sv3), h);
f4 = async(launch::async, integral, ref(sv4), h);
b=f1.get()+f2.get()+f3.get()+f4.get();
tt et2 = chrono::steady_clock::now();
duration<double> time_span2 = duration_cast<duration<double>>(et2 - bt2);
cout << time_span2.count() << endl;
cout << a << " " << b << endl;
return 0;
}

Related

Construct SimplicialLLT object from provided L matrix

I performed LLT factorization of a sparse matrix (SimplicialLLT) and modified the L matrix to Lmod (EDIT: I made a copy of the matrix with some modifications). I would like to construct a new SimplicialLLT object from this modified matrix Lmod so that I can directly use it to obtain a solution of system (Lmod*Lmod')x = B. Is this somehow possible in Eigen? Here is a short code example:
#include <iostream>
#include <Eigen/Dense>
#include <Eigen/Sparse>
using namespace Eigen;
using namespace std;
int main()
{
// Initial system A0 x0 = B
// Create dense matrix first
MatrixXd A0dense(5, 5);
A0dense << 1,0,0,0,0,0,2,-1,0,0,0,-1,2,-1,0,0,0,-1,1,0,0,0,0,0,1;
// Convert it to sparse
SparseMatrix<double> A0;
A0 = A0dense.sparseView();
// Create LLT decomposition od A0
SimplicialLLT<SparseMatrix<double>, Lower, NaturalOrdering<int>> LLTofA0(A0);
// Extract the L matrix
SparseMatrix<double> L0 = LLTofA0.matrixL();
cout << "L0 =" << endl << L0 << endl;
// RHS vector B
VectorXd B(5);
B << 0, 0, 0, 1, 1;
// Create a copy of L0 with some modifications
vector<Triplet<double>> Triplets; // Vector of triplets
for (int k = 0; k < L0.outerSize(); ++k)
{
for (SparseMatrix<double>::InnerIterator it(L0, k); it; ++it) // Iterate over nonzero entries of L0
{
if (it.col() != 2)
{
// If the column is unequal to 2, store the location and value
Triplet<double> T(it.row(), it.col(), it.value());
Triplets.push_back(T);
}
}
}
// Create modified Lmod matrix from the stored triplets
SparseMatrix<double> Lmod(L0.outerSize(), L0.outerSize());
Lmod.setFromTriplets(Triplets.begin(), Triplets.end());
cout << "Lmod =" << endl << Lmod << endl;
// Now I want to solve different system (Lmod*Lmod') x = B and use the matrix Lmod for it
}

Unexpected and large runtime variations in Eigen for matrix multiplies

I am comparing ways to perform equivalent matrix operations within Eigen, and am getting extraordinarily different runtimes, including some non-intuitive results.
I am comparing three mathematically equivalent forms of the matrix multiplication:
wx * transpose(data)
The three forms I'm comparing are:
result = wx * data.transpose() (straight multiply version)
result.noalias() = wx * data.transpose() (noalias version)
result = (data * wx.transpose()).transpose() (transposed version)
I am also testing using both Column Major and Row Major storage.
With column major storage, the transposed version is significantly faster (an order of magnitude) than both the straight multiply and the no alias version, which are both approximately equal in runtime.
With row major storage, the noalias and the transposed version are both significantly faster than the straight multiply in runtime.
I understand that Eigen uses lazy evaluation, and that the immediate results returned from an operation are often expression templates, and are not the intermediate values. I also understand that matrix * matrix operations will always produce a temporary when they are the last operation on the right hand side, to avoid aliasing issues, hence why I am attempting to speed things up through noalias().
My main questions:
Why is the transposed version always significantly faster, even (in the case of column major storage) when I explicitly state noalias so no temporaries are created?
Why does the (significant) difference in runtime only occur between the straight multiply and the noalias version when using column major storage?
The code I am using for this is below. It is being compiled using gcc 4.9.2, on a Centos 6 install, using the following command line.
g++ eigen_test.cpp -O3 -std=c++11 -o eigen_test -pthread -fopenmp -finline-functions
using Matrix = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>;
// using Matrix = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>;
int wx_rows = 8000;
int wx_cols = 1000;
int samples = 1;
// Eigen::MatrixXf matrix = Eigen::MatrixXf::Random(matrix_rows, matrix_cols);
Matrix wx = Eigen::MatrixXf::Random(wx_rows, wx_cols);
Matrix data = Eigen::MatrixXf::Random(samples, wx_cols);
Matrix result;
unsigned int iterations = 10000;
float sum = 0;
auto before = std::chrono::high_resolution_clock::now();
for (unsigned int ii = 0; ii < iterations; ++ii)
{
result = wx * data.transpose();
sum += result(result.rows() - 1, result.cols() - 1);
}
auto after = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(after - before).count();
std::cout << "original sum: " << sum << std::endl;
std::cout << "original time (ms): " << duration << std::endl;
std::cout << std::endl;
sum = 0;
before = std::chrono::high_resolution_clock::now();
for (unsigned int ii = 0; ii < iterations; ++ii)
{
result.noalias() = wx * data.transpose();
sum += result(wx_rows - 1, samples - 1);
}
after = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(after - before).count();
std::cout << "alias sum: " << sum << std::endl;
std::cout << "alias time (ms) : " << duration << std::endl;
std::cout << std::endl;
sum = 0;
before = std::chrono::high_resolution_clock::now();
for (unsigned int ii = 0; ii < iterations; ++ii)
{
result = (data * wx.transpose()).transpose();
sum += result(wx_rows - 1, samples - 1);
}
after = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(after - before).count();
std::cout << "new sum: " << sum << std::endl;
std::cout << "new time (ms) : " << duration << std::endl;
One half of the explanation is because, in the current version of Eigen, multi-threading is achieved by splitting the work over blocks of columns of the result (and the right-hand-side). With only 1 column, multi-threading does not take place. In the column-major case, this explain why cases 1 and 2 underperform. On the other hand, case 3 is evaluated as:
column_major_tmp.noalias() = data * wx.transpose();
result = column_major_tmp.transpose();
and since wx.transpose().cols() is huge, multi-threading is effective.
To understand the row-major case, you also need to know that internally matrix products is implemented for a column-major destination. If the destination is row-major, as in case 2, then the product is transposed, so what really happens is:
row_major_result.transpose().noalias() = data * wx.transpose();
and so we're back to case 3 but without temporary.
This is clearly a limitation of current Eigen's multi-threading implementation for highly unbalanced matrix sizes. Ideally threads should be spread on row-block and/or column-block depending on the size of the matrices at hand.
BTW, you should also compile with -march=native to let Eigen fully exploit your CPU (AVX, FMA, AVX512...).

C++ - Function is completely skipped if an internal variable exceeds ~60,000

I wrote the following for a class, but came across some strange behavior while testing it. arrayProcedure is meant to do things with an array based on the 2 "tweaks" at the top of the function (arrSize, and start). For the assignment, arrSize must be 10,000, and start, 100. Just for kicks, I decided to see what happens if I increase them, and for some reason, if arrSize exceeds around 60,000 (I haven't found the exact limit), the program immediately crashes with a stack overflow when using a debugger:
Unhandled exception at 0x008F6977 in TMA3Question1.exe: 0xC00000FD: Stack overflow (parameters: 0x00000000, 0x00A32000).
If I just run it without a debugger, I don't get any helpful errors; windows hangs for a fraction of a second, then gives me an error TMA3Question1.exe has stopped working.
I decided to play around with debugging it, but that didn't shed any light. I placed breaks above and below the call to arrayProcedure, as well as peppered inside of it. When arrSize doesn't exceed 60,000 it runs fine: It pauses before calling arrayProcedure, properly waits at all the points inside of it, then pauses on the break underneath the call.
If I raise arrSize however, the break before the call happens, but it appears as though it never even steps into arrayProcedure; it immediately gives me a stack overflow without pausing at any of the internal breakpoints.
The only thing I can think of is the resulting arrays exceeds my computer's current memory, but that doesn't seem likely for a couple reasons:
It should only use just under a megabyte:
sizeof(double) = 8 bytes
8 * 60000 = 480000 bytes per array
480000 * 2 = 960000 bytes for both arrays
As far as I know, arrays aren't immediately constructed when I function is entered; they're allocated on definition. I placed several breakpoints before the arrays are even declared, and they are never reached.
Any light that you could shed on this would be appreciated.
The code:
#include <iostream>
#include <ctime>
//CLOCKS_PER_SEC is a macro supplied by ctime
double msBetween(clock_t startTime, clock_t endTime) {
return endTime - startTime / (CLOCKS_PER_SEC * 1000.0);
}
void initArr(double arr[], int start, int length, int step) {
for (int i = 0, j = start; i < length; i++, j += step) {
arr[i] = j;
}
}
//The function we're going to inline in the next question
void helper(double a1, double a2) {
std::cout << a1 << " * " << a2 << " = " << a1 * a2 << std::endl;
}
void arrayProcedure() {
const int arrSize = 70000;
const int start = 1000000;
std::cout << "Checking..." << std::endl;
if (arrSize > INT_MAX) {
std::cout << "Given arrSize is too high and exceeds the INT_MAX of: " << INT_MAX << std::endl;
return;
}
double arr1[arrSize];
double arr2[arrSize];
initArr(arr1, start, arrSize, 1);
initArr(arr2, arrSize + start - 1, arrSize, -1);
for (int i = 0; i < arrSize; i++) {
helper(arr1[i], arr2[i]);
}
}
int main(int argc, char* argv[]) {
using namespace std;
const clock_t startTime = clock();
arrayProcedure();
clock_t endTime = clock();
cout << endTime << endl;
double elapsedTime = msBetween(startTime, endTime);
cout << "\n\n" << elapsedTime << " milliseconds. ("
<< elapsedTime / 60000 << " minutes)\n";
}
The default stack size is 1 MB with Visual Studio.
https://msdn.microsoft.com/en-us/library/tdkhxaks.aspx
You can increase the stack size or use the new operator.
double *arr1 = new double[arrSize];
double *arr2 = new double[arrSize];
...
delete [] arr1;
delete [] arr2;

Is fftw output depending on size of input?

In the last week i have been programming some 2-dimensional convolutions with FFTW, by passing to the frequency domain both signals, multiplying, and then coming back.
Surprisingly, I am getting the correct result only when input size is less than a fixed number!
I am posting some working code, in which i take simple initial constant matrixes of value 2 for the input, and 1 for the filter on the spatial domain. This way, the result of convolving them should be a matrix of the average of the first matrix values, i.e., 2, since it is constant. This is the output when I vary the sizes of width and height from 0 to h=215, w=215 respectively; If I set h=216, w=216, or greater, then the output gets corrupted!! I would really appreciate some clues about where could I be making some mistake. Thank you very much!
#include <fftw3.h>
int main(int argc, char* argv[]) {
int h=215, w=215;
//Input and 1 filter are declared and initialized here
float *in = (float*) fftwf_malloc(sizeof(float)*w*h);
float *identity = (float*) fftwf_malloc(sizeof(float)*w*h);
for(int i=0;i<w*h;i++){
in[i]=5;
identity[i]=1;
}
//Declare two forward plans and one backward
fftwf_plan plan1, plan2, plan3;
//Allocate for complex output of both transforms
fftwf_complex *inTrans = (fftwf_complex*) fftw_malloc(sizeof(fftwf_complex)*h*(w/2+1));
fftwf_complex *identityTrans = (fftwf_complex*) fftw_malloc(sizeof(fftwf_complex)*h*(w/2+1));
//Initialize forward plans
plan1 = fftwf_plan_dft_r2c_2d(h, w, in, inTrans, FFTW_ESTIMATE);
plan2 = fftwf_plan_dft_r2c_2d(h, w, identity, identityTrans, FFTW_ESTIMATE);
//Execute them
fftwf_execute(plan1);
fftwf_execute(plan2);
//Multiply in frequency domain. Theoretically, no need to multiply imaginary parts; since signals are real and symmetric
//their transform are also real, identityTrans[i][i] = 0, but i leave here this for more generic implementation.
for(int i=0; i<(w/2+1)*h; i++){
inTrans[i][0] = inTrans[i][0]*identityTrans[i][0] - inTrans[i][1]*identityTrans[i][1];
inTrans[i][1] = inTrans[i][0]*identityTrans[i][1] + inTrans[i][1]*identityTrans[i][0];
}
//Execute inverse transform, store result in identity, where identity filter lied.
plan3 = fftwf_plan_dft_c2r_2d(h, w, inTrans, identity, FFTW_ESTIMATE);
fftwf_execute(plan3);
//Output first results of convolution(in, identity) to see if they are the average of in.
for(int i=0;i<h/h+4;i++){
for(int j=0;j<w/w+4;j++){
std::cout<<"After convolution, component (" << i <<","<< j << ") is " << identity[j+i*w]/(w*h*w*h) << endl;
}
}std::cout<<endl;
//Compute average of data
float sum=0.0;
for(int i=0; i<w*h;i++)
sum+=in[i];
std::cout<<"Mean of input was " << (float)sum/(w*h) << endl;
std::cout<< endl;
fftwf_destroy_plan(plan1);
fftwf_destroy_plan(plan2);
fftwf_destroy_plan(plan3);
return 0;
}
Your problem has nothing to do with fftw ! It comes from this line :
std::cout<<"After convolution, component (" << i <<","<< j << ") is " << identity[j+i*w]/(w*h*w*h) << endl;
if w=216 and h=216 then `w*h*w*h=2 176 782 336. The higher limit for signed 32bit integer is 2 147 483 647. You are facing an overflow...
Solution is to cast the denominator to float.
std::cout<<"After convolution, component (" << i <<","<< j << ") is " << identity[j+i*w]/(((float)w)*h*w*h) << endl;
The next trouble that you are going to face is this one :
float sum=0.0;
for(int i=0; i<w*h;i++)
sum+=in[i];
Remember that a float has 7 useful decimal digits. If w=h=4000, the computed average will be lower than the real one. Use a double or write two loops and sum on the inner loop (localsum) before summing the outer loop (sum+=localsum) !
Bye,
Francis

Benchmark problems for testing concurency

For one of the projects I'm doing right now, I need to look at the performance (amongst other things) of different concurrent enabled programming languages.
At the moment I'm looking into comparing stackless python and C++ PThreads, so the focus is on these two languages, but other languages will probably be tested later. Ofcourse the comparison must be as representative and accurate as possible, so my first thought was to start looking for some standard concurrent/multi-threaded benchmark problems, alas I couldn't find any decent or standard, tests/problems/benchmarks.
So my question is as follows: Do you have a suggestion for a good, easy or quick problem to test the performance of the programming language (and to expose it's strong and weak points in the process)?
Surely you should be testing hardware and compilers rather than a language for concurrency performance?
I would be looking at a language from the point of view of how easy and productive it is in terms of concurrency and how much it 'insulates' the programmer from making locking mistakes.
EDIT: from past experience as a researcher designing parallel algorithms, I think you will find in most cases the concurrent performance will depend largely on how an algorithm is parallelised, and how it targets the underlying hardware.
Also, benchmarks are notoriously unequal; this is even more so in a parallel environment. For instance, a benchmark that 'crunches' very large matrices would be suited to a vector pipeline processor, whereas a parallel sort might be better suited to more general purpose multi core CPUs.
These might be useful:
Parallel Benchmarks
NAS Parallel Benchmarks
Well, there are a few classics, but different tests emphasize different features. Some distributed systems may be more robust, have more efficient message-passing, etc. Higher message overhead can hurt scalability, since it the normal way to scale up to more machines is to send a larger number of small messages. Some classic problems you can try are a distributed Sieve of Eratosthenes or a poorly implemented fibonacci sequence calculator (i.e. to calculate the 8th number in the series, spin of a machine for the 7th, and another for the 6th). Pretty much any divide-and-conquer algorithm can be done concurrently. You could also do a concurrent implementation of Conway's game of life or heat transfer. Note that all of these algorithms have different focuses and thus you probably will not get one distributed system doing the best in all of them.
I'd say the easiest one to implement quickly is the poorly implemented fibonacci calculator, though it places too much emphasis on creating threads and too little on communication between those threads.
Surely you should be testing hardware
and compilers rather than a language
for concurrency performance?
No, hardware and compilers are irrelevant for my testing purposes. I'm just looking for some good problems that can test how well code, written in one language, can compete against code from another language. I'm really testing the constructs available in the specific languages to do concurrent programming. And one of the criteria is performance (measured in time).
Some of the other test criteria I'm looking for are:
how easy is it to write correct code; because as we all know concurrent programming is harder then writing single threaded programs
what is the technique used to to concurrent programming: event-driven, actor based, message parsing, ...
how much code must be written by the programmer himself and how much is done automatically for him: this can also be tested with the given benchmark problems
what's the level of abstraction and how much overhead is involved when translated back to machine code
So actually, I'm not looking for performance as the only and best parameter (which would indeed send me to the hardware and the compilers instead of the language itself), I'm actually looking from a programmers point of view to check what language is best suited for what kind of problems, what it's weaknesses and strengths are and so on...
Bare in mind that this is just a small project and the tests are therefore to be kept small as well. (rigorous testing of everything is therefore not feasible)
I have decided to use the Mandelbrot set (the escape time algorithm to be more precise) to benchmark the different languages.
It fits me quite well as the original algorithm can easily be implemented and creating the multi threaded variant from it is not that much work.
below is the code I currently have. It is still a single threaded variant, but I'll update it as soon as I'm satisfied with the result.
#include <cstdlib> //for atoi
#include <iostream>
#include <iomanip> //for setw and setfill
#include <vector>
int DoThread(const double x, const double y, int maxiter) {
double curX,curY,xSquare,ySquare;
int i;
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++) {
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
return i;
}
void SingleThreaded(int horizPixels, int vertPixels, int maxiter, std::vector<std::vector<int> >& result) {
for(int x = horizPixels; x > 0; x--) {
for(int y = vertPixels; y > 0; y--) {
//3.0 -> so we always have -1.5 -> 1.5 as the window; (x - (horizPixels / 2) will go from -horizPixels/2 to +horizPixels/2
result[x-1][y-1] = DoThread((3.0 / horizPixels) * (x - (horizPixels / 2)),(3.0 / vertPixels) * (y - (vertPixels / 2)),maxiter);
}
}
}
int main(int argc, char* argv[]) {
//first arg = length along horizontal axis
int horizPixels = atoi(argv[1]);
//second arg = length along vertical axis
int vertPixels = atoi(argv[2]);
//third arg = iterations
int maxiter = atoi(argv[3]);
//fourth arg = threads
int threadCount = atoi(argv[4]);
std::vector<std::vector<int> > result(horizPixels, std::vector<int>(vertPixels,0)); //create and init 2-dimensional vector
SingleThreaded(horizPixels, vertPixels, maxiter, result);
//TODO: remove these lines
for(int y = 0; y < vertPixels; y++) {
for(int x = 0; x < horizPixels; x++) {
std::cout << std::setw(2) << std::setfill('0') << std::hex << result[x][y] << " ";
}
std::cout << std::endl;
}
}
I've tested it with gcc under Linux, but I'm sure it works under other compilers/Operating Systems as well. To get it to work you have to enter some command line arguments like so:
mandelbrot 106 500 255 1
the first argument is the width (x-axis)
the second argument is the height (y-axis)
the third argument is the number of maximum iterations (the number of colors)
the last ons is the number of threads (but that one is currently not used)
on my resolution, the above example gives me a nice ASCII-art representation of a Mandelbrot set. But try it for yourself with different arguments (the first one will be the most important one, as that will be the width)
Below you can find the code I hacked together to test the multi threaded performance of pthreads. I haven't cleaned it up and no optimizations have been made; so the code is a bit raw.
the code to save the calculated mandelbrot set as a bitmap is not mine, you can find it here
#include <cstdlib> //for atoi
#include <iostream>
#include <iomanip> //for setw and setfill
#include <vector>
#include "bitmap_Image.h" //for saving the mandelbrot as a bmp
#include <pthread.h>
pthread_mutex_t mutexCounter;
int sharedCounter(0);
int percent(0);
int horizPixels(0);
int vertPixels(0);
int maxiter(0);
//doesn't need to be locked
std::vector<std::vector<int> > result; //create 2 dimensional vector
void *DoThread(void *null) {
double curX,curY,xSquare,ySquare,x,y;
int i, intx, inty, counter;
counter = 0;
do {
counter++;
pthread_mutex_lock (&mutexCounter); //lock
intx = int((sharedCounter / vertPixels) + 0.5);
inty = sharedCounter % vertPixels;
sharedCounter++;
pthread_mutex_unlock (&mutexCounter); //unlock
//exit thread when finished
if (intx >= horizPixels) {
std::cout << "exited thread - I did " << counter << " calculations" << std::endl;
pthread_exit((void*) 0);
}
//set x and y to the correct value now -> in the range like singlethread
x = (3.0 / horizPixels) * (intx - (horizPixels / 1.5));
y = (3.0 / vertPixels) * (inty - (vertPixels / 2));
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++){
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
result[intx][inty] = i;
} while (true);
}
int DoSingleThread(const double x, const double y) {
double curX,curY,xSquare,ySquare;
int i;
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++){
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
return i;
}
void SingleThreaded(std::vector<std::vector<int> >& result) {
for(int x = horizPixels - 1; x != -1; x--) {
for(int y = vertPixels - 1; y != -1; y--) {
//3.0 -> so we always have -1.5 -> 1.5 as the window; (x - (horizPixels / 2) will go from -horizPixels/2 to +horizPixels/2
result[x][y] = DoSingleThread((3.0 / horizPixels) * (x - (horizPixels / 1.5)),(3.0 / vertPixels) * (y - (vertPixels / 2)));
}
}
}
void MultiThreaded(int threadCount, std::vector<std::vector<int> >& result) {
/* Initialize and set thread detached attribute */
pthread_t thread[threadCount];
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (int i = 0; i < threadCount - 1; i++) {
pthread_create(&thread[i], &attr, DoThread, NULL);
}
std::cout << "all threads created" << std::endl;
for(int i = 0; i < threadCount - 1; i++) {
pthread_join(thread[i], NULL);
}
std::cout << "all threads joined" << std::endl;
}
int main(int argc, char* argv[]) {
//first arg = length along horizontal axis
horizPixels = atoi(argv[1]);
//second arg = length along vertical axis
vertPixels = atoi(argv[2]);
//third arg = iterations
maxiter = atoi(argv[3]);
//fourth arg = threads
int threadCount = atoi(argv[4]);
result = std::vector<std::vector<int> >(horizPixels, std::vector<int>(vertPixels,21)); // init 2-dimensional vector
if (threadCount <= 1) {
SingleThreaded(result);
} else {
MultiThreaded(threadCount, result);
}
//TODO: remove these lines
bitmapImage image(horizPixels, vertPixels);
for(int y = 0; y < vertPixels; y++) {
for(int x = 0; x < horizPixels; x++) {
image.setPixelRGB(x,y,16777216*result[x][y]/maxiter % 256, 65536*result[x][y]/maxiter % 256, 256*result[x][y]/maxiter % 256);
//std::cout << std::setw(2) << std::setfill('0') << std::hex << result[x][y] << " ";
}
std::cout << std::endl;
}
image.saveToBitmapFile("~/Desktop/test.bmp",32);
}
good results can be obtained using the program with the following arguments:
mandelbrot 5120 3840 256 3
that way you will get an image that is 5 * 1024 wide; 5 * 768 high with 256 colors (alas you will only get 1 or 2) and 3 threads (1 main thread that doesn't do any work except creating the worker threads, and 2 worker threads)
Since the benchmarks game moved to a quad-core machine September 2008, many programs in different programming languages have been re-written to exploit quad-core - for example, the first 10 mandelbrot programs.

Resources