Fast Modulo 511 and 127 - performance

Is there a way, how to make modulo by 511 (and 127) faster than using "%" operator ?
int c = 758 % 511;
int d = 423 % 127;

Here is a way to do fast modulo by 511 assuming that x is at most 32767. It's about twice as fast as x%511. It does the modulo in five steps: two multiply, two addition, one shift.
inline int fast_mod_511(int x) {
int y = (513*x+64)>>18;
return x - 511*y;
}
Here is the theory at how I arrive at this. I posted the code I tested this at the end
Let's consider
y = x/511 = x/(512-1) = x/1000 * 1/(1-1/512).
Let's define z = 512, then
y = x/z*1/(1-1/z).
Using Taylor expansion
y = x/z(1 + 1/z + 1/z^2 + 1/z^3 + ...).
Now if we know that x has a limited range we can cut the expansion. Let's assume x is always less than 2^15=32768. Then we can write
512*512*y = (1+512)*x = 513*x.
After looking at the digits which are significant we arrive at
y = (513*x+64)>>18 //512^2 = 2^18.
We can divide x/511 (assuming x is less than 32768) in three steps:
multiply,
add,
shift.
Here is the code I just to profile this in MSVC2013 64-bit release mode on an Ivy Bridge core.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
inline int fast_mod_511(int x) {
int y = (513*x+64)>>18;
return x - 511*y;
}
int main() {
unsigned int i, x;
volatile unsigned int r;
double dtime;
dtime = omp_get_wtime();
for(i=0; i<100000; i++) {
for(int j=0; j<32768; j++) {
r = j%511;
}
}
dtime =omp_get_wtime() - dtime;
printf("time %f\n", dtime);
dtime = omp_get_wtime();
for(i=0; i<100000; i++) {
for(int j=0; j<32768; j++) {
r = fast_mod_511(j);
}
}
dtime =omp_get_wtime() - dtime;
printf("time %f\n", dtime);
}

You can use a lookup table with the solutions pre-stored. If you create an array of a million integers looking up is about twice as fast as actually doing modulo in my C# app.
// fill an array
var mod511 = new int[1000000];
for (int x = 0; x < 1000000; x++) mod511[x] = x % 511;
and instead of using
c = 758 % 511;
you use
c = mod511[758];
This will cost you (possibly a lot of) memory, and will obviously not work if you want to use it for very large numbers also. But it is faster.

If you have to repeat those two modulus operations on a large number of data and your CPU supports SIMD (for example Intel's SSE/AVX/AVX2) then you can vectorize the operations, i.e., do the operations on many data in parallel. You can do this by using intrinsics or inline assembly. Yes the solution will be platform specific but maybe that is fine...

Related

Matrix multiplication via std::vector is 10 times slower than numpy

Although it is known that using nested std::vector to represent matrices is a bad idea, let's use it for now since it is flexible and many existing functions can handle std::vector.
I thought, in small cases, the speed difference can be ignored. But it turned out that vector<vector<double>> is 10+ times slower than numpy.dot().
Let A and B be matrices whose size is sizexsize. Assuming square matrices is just for simplicity. (We don't intend to limit discussion to the square matrices case.) We initialize each matrix in a deterministic way, and finally calculate C = A * B.
We define "calculation time" as the time elapsed just to calculate C = A * B. In other words, various overheads are not included.
Python3 code
import numpy as np
import time
import sys
if (len(sys.argv) != 2):
print("Pass `size` as an argument.", file = sys.stderr);
sys.exit(1);
size = int(sys.argv[1]);
A = np.ndarray((size, size));
B = np.ndarray((size, size));
for i in range(size):
for j in range(size):
A[i][j] = i * 3.14 + j
B[i][j] = i * 3.14 - j
start = time.time()
C = np.dot(A, B);
print("{:.3e}".format(time.time() - start), file = sys.stderr);
C++ code
using namespace std;
#include <iostream>
#include <vector>
#include <chrono>
int main(int argc, char **argv) {
if (argc != 2) {
cerr << "Pass `size` as an argument.\n";
return 1;
}
const unsigned size = atoi(argv[1]);
vector<vector<double>> A(size, vector<double>(size));
vector<vector<double>> B(size, vector<double>(size));
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
A[i][j] = i * 3.14 + j;
B[i][j] = i * 3.14 - j;
}
}
auto start = chrono::system_clock::now();
vector<vector<double>> C(size, vector<double>(size, /* initial_value = */ 0));
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
cerr << scientific;
cerr.precision(3);
cerr << chrono::duration<double>(chrono::system_clock::now() - start).count() << "\n";
}
C++ code (multithreaded)
We also wrote a multithreaded version of C++ code since numpy.dot() is automatically calculated in parallel.
You can get all the codes from GitHub.
Result
C++ version is 10+ times slower than Python 3 (with numpy) version.
matrix_size: 200x200
--------------- Time in seconds ---------------
C++ (not multithreaded): 8.45e-03
C++ (1 thread): 8.66e-03
C++ (2 threads): 4.68e-03
C++ (3 threads): 3.14e-03
C++ (4 threads): 2.43e-03
Python 3: 4.07e-04
-----------------------------------------------
matrix_size: 400x400
--------------- Time in seconds ---------------
C++ (not multithreaded): 7.011e-02
C++ (1 thread): 6.985e-02
C++ (2 threads): 3.647e-02
C++ (3 threads): 2.462e-02
C++ (4 threads): 1.915e-02
Python 3: 1.466e-03
-----------------------------------------------
Question
Is there any way to make the C++ implementation faster?
Optimizations I Tried
swap calculation order -> at most 3.5 times faster (not than numpy code but than C++ code)
optimization 1 plus partial unroll -> at most 4.5 times faster, but this can be done only when size is known in advance No. As pointed out in this comment, size is not needed to be known. We can just limit the max value of loop variables of unrolled loops and process remaining elements with normal loops. See my implementation for example.
optimization 2, plus minimizing the call of C[i][j] by introducing a simple variable sum -> at most 5.2 times faster. The implementation is here. This result implies std::vector::operator[] is un-ignorably slow.
optimization 3, plus g++ -march=native flag -> at most 6.2 times faster (By the way, we use -O3 of course.)
Optimization 3, plus reducing the call of operator [] by introducing a pointer to an element of A since A's elements are sequentially accessed in the unrolled loop. -> At most 6.2 times faster, and a little little bit faster than Optimization 4. The code is shown below.
g++ -funroll-loops flag to unroll for loops -> no change
g++ #pragma GCC unroll n -> no change
g++ -flto flag to turn on link time optimizations -> no change
Block Algorithm -> no change
transpose B to avoid cache miss -> no change
long linear std::vector instead of nested std::vector<std::vector>, swap calculation order, block algorithm, and partial unroll -> at most 2.2 times faster
Optimization 1, plus PGO(profile-guided optimization) -> 4.7 times faster
Optimization 3, plus PGO -> same as Optimization 3
Optimization 3, plus g++ specific __builtin_prefetch() -> same as Optimization 3
Current Status
(originally) 13.06 times slower -> (currently) 2.10 times slower
Again, you can get all the codes on GitHub. But let us cite some codes, all of which are functions called from the multithreaded version of C++ code.
Original Code (GitHub)
void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {
const unsigned j_max = B[0].size();
const unsigned k_max = B.size();
for (int i = row_start; i < row_end; ++i) {
for (int j = 0; j < j_max; ++j) {
for (int k = 0; k < k_max; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
Current Best Code (GitHub)
This is the implementation of the Optimization 5 above.
void f(const vector<vector<double>> &A, const vector<vector<double>> &B, vector<vector<double>> &C, unsigned row_start, unsigned row_end) {
static const unsigned num_unroll = 5;
const unsigned j_max = B[0].size();
const unsigned k_max_for_unrolled_loop = B.size() / num_unroll * num_unroll;
const unsigned k_max = B.size();
for (int i = row_start; i < row_end; ++i) {
for (int k = 0; k < k_max_for_unrolled_loop; k += num_unroll) {
for (int j = 0; j < j_max; ++j) {
const double *p = A[i].data() + k;
double sum;
sum = *p++ * B[k][j];
sum += *p++ * B[k+1][j];
sum += *p++ * B[k+2][j];
sum += *p++ * B[k+3][j];
sum += *p++ * B[k+4][j];
C[i][j] += sum;
}
}
for (int k = k_max_for_unrolled_loop; k < k_max; ++k) {
const double a = A[i][k];
for (int j = 0; j < j_max; ++j) {
C[i][j] += a * B[k][j];
}
}
}
}
We've tried many optimizations since we first posted this question. We spent whole two days struggling with this problem, and finally reached the point where we have no more idea how to optimize the current best code. We doubt more complex algorithms like Strassen's will do it better since cases we handle are not large and each operation on std::vector is so expensive that, as we've seen, just reducing the call of [] improved the performance well.
We (want to) believe we can make it better, though.
Matrix multiplication is relativly easy to optimize. However if you want to get to decent cpu utilization it becomes tricky because you need deep knowledge of the hardware you are using. The steps to implement a fast matmul kernel are the following:
Use SIMDInstructions
Use Register Blocking and fetch multiple data at once
Optimize for your chache lines (mainly L2 and L3)
Parallelize your code to use multiple threads
Under this linke is a very good ressource, that explains all the nasty details:
https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0
If you want more indepth advise leave a comment.

Why is local memory in this OpenCL algorithm so slow?

I am writing some OpenCL code. My kernel should create a special "accumulator" output based on an input image. I have tried two concepts and both are equally slow, although the second one uses local memory. Could you please help me identify why the local memory version is so slow? The target GPU for the kernels is a AMD Radeon Pro 450.
// version one
__kernel void find_points(__global const unsigned char* input, __global unsigned int* output) {
const unsigned int x = get_global_id(0);
const unsigned int y = get_global_id(1);
int ind;
for(k = SOME_BEGINNING; k <= SOME_END; k++) {
// some pretty wild calculation
// ind is not linear and accesses different areas of the output
ind = ...
if(input[y * WIDTH + x] == 255) {
atomic_inc(&output[ind]);
}
}
}
// variant two
__kernel void find_points(__global const unsigned char* input, __global unsigned int* output) {
const unsigned int x = get_global_id(0);
const unsigned int y = get_global_id(1);
__local int buf[7072];
if(y < 221 && x < 32) {
buf[y * 32 + x] = 0;
}
barrier(CLK_LOCAL_MEM_FENCE);
int ind;
int k;
for(k = SOME_BEGINNING; k <= SOME_END; k++) {
// some pretty wild calculation
// ind is not linear and access different areas of the output
ind = ...
if(input[y * WIDTH + x] == 255) {
atomic_inc(&buf[ind]);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
if(get_local_id(0) == get_local_size(0) - 1)
for(k = 0; k < 7072; k++)
output[k] = buf[k];
}
}
I would expect that the second variant is faster than the first one, but it isn't. Sometimes it is even slower.
Local buffer size __local int buf[7072] (28288 bytes) is too big. I don't know how big shared memory for AMD Radeon Pro 450 is but likely that is 32kB or 64kB per computing unit.
32768/28288 = 1, 65536/28288 = 2 means only 1 or maximum 2 wavefronts (64 work items) can run simultaneously only, so occupancy of computing unit is very very low hence poor performance.
Your aim should be to reduce local buffer as much as possible so that more wavefronts can be processed simultaneously.
Use CodeXL to profile your kernel - there are tools to show you all of this.
Alternatively you can have a look at CUDA occupancy calculator excel spreadsheet if you don't want to run the profiler to get a better idea of what that is about.

Eigen JacobiSVD cuda compile error

I've got an error, regarding calling JacobiSVD in my cuda function.
This is the part of the code that causing the error.
Eigen::JacobiSVD<Eigen::Matrix3d> svd( cov_e, Eigen::ComputeThinU | Eigen::ComputeThinV);
And this is the error message.
CUDA_voxel_building.cu(43): error: calling a __host__
function("Eigen::JacobiSVD , (int)2> ::JacobiSVD") from a __global__
function("kernel") is not allowed
I've used the following command to compile it.
nvcc -std=c++11 -D_MWAITXINTRIN_H_INCLUDED -D__STRICT_ANSI__ -ptx CUDA_voxel_building.cu
I'm using code 8.0 with eigen3 on ubuntu 16.04.
It seems like other functions such as eigen value decomposition also gives the same error.
Anyone knows a solution? I'm enclosing my code below.
//nvcc -ptx CUDA_voxel_building.cu
#include </usr/include/eigen3/Eigen/Core>
#include </usr/include/eigen3/Eigen/SVD>
/*
#include </usr/include/eigen3/Eigen/Sparse>
#include </usr/include/eigen3/Eigen/Dense>
#include </usr/include/eigen3/Eigen/Eigenvalues>
*/
__global__ void kernel(double *p, double *breaks,double *ind, double *mu, double *cov, double *e,double *v, int *n, char *isgood, int minpts, int maxgpu){
bool debuginfo = false;
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if(debuginfo)printf("Thread %d got pointer\n",idx);
if( idx < maxgpu){
int s_ind = breaks[idx];
int e_ind = breaks[idx+1];
int diff = e_ind-s_ind;
if(diff >minpts){
int cnt = 0;
Eigen::MatrixXd local_p(3,diff) ;
for(int k = s_ind;k<e_ind;k++){
int temp_ind=ind[k];
//Eigen::Matrix<double, 3, diff> local_p;
local_p(1,cnt) = p[temp_ind*3];
local_p(2,cnt) = p[temp_ind*3+1];
local_p(3,cnt) = p[temp_ind*3+2];
cnt++;
}
Eigen::Matrix3d centered = local_p.rowwise() - local_p.colwise().mean();
Eigen::Matrix3d cov_e = (centered.adjoint() * centered) / double(local_p.rows() - 1);
Eigen::JacobiSVD<Eigen::Matrix3d> svd( cov_e, Eigen::ComputeThinU | Eigen::ComputeThinV);
/* Eigen::Matrix3d Cp = svd.matrixU() * svd.singularValues().asDiagonal() * svd.matrixV().transpose();
mu[idx]=p[ind[s_ind]*3];
mu[idx+1]=p[ind[s_ind+1]*3];
mu[idx+2]=p[ind[s_ind+2]*3];
e[idx]=svd.singularValues()(0);
e[idx+1]=svd.singularValues()(1);
e[idx+2]=svd.singularValues()(2);
n[idx] = diff;
isgood[idx] = 1;
for(int x = 0; x < 3; x++)
{
for(int y = 0; y < 3; y++)
{
v[x+ 3*y +idx*9]=svd.matrixV()(x, y);
cov[x+ 3*y +idx*9]=cov_e(x, y);
//if(debuginfo)printf("%f ",R[x+ 3*y +i*9]);
if(debuginfo)printf("%f ",Rm(x, y));
}
}
*/
} else {
mu[idx]=0;
mu[idx+1]=0;
mu[idx+2]=0;
e[idx]=0;
e[idx+1]=0;
e[idx+2]=0;
n[idx] = 0;
isgood[idx] = 0;
for(int x = 0; x < 3; x++)
{
for(int y = 0; y < 3; y++)
{
v[x+ 3*y +idx*9]=0;
cov[x+ 3*y +idx*9]=0;
}
}
}
}
}
First of all, Ubuntu 16.04 provides Eigen 3.3-beta1, which is not really recommended to be used. I would suggest upgrading to a more recent version. Furthermore, to include Eigen, write (e.g.):
#include <Eigen/Eigenvalues>
and compile with -I /usr/include/eigen3 (if you use the version provided by the OS), or better -I /path/to/local/eigen-version.
Then, as talonmies noted, you can't call host-functions from kernels, (I'm not sure at the moment, why JacobiSVD is not marked as device function), but in your case it would make much more sense to use Eigen::SelfAdjointEigenSolver, anyway. Since the matrix you are decomposing is fixed-size 3x3 you should actually use the optimized computeDirect method:
Eigen::SelfAdjointEigenSolver<Eigen::Matrix3d> eig; // default constructor
eig.computeDirect(cov_e); // works for 2x2 and 3x3 matrices, does not require loops
It seems the computeDirect even works on the beta version provided by Ubuntu (I'd still recommend to update).
Some unrelated notes:
The following is wrong, since you should start with index 0:
local_p(1,cnt) = p[temp_ind*3];
local_p(2,cnt) = p[temp_ind*3+1];
local_p(3,cnt) = p[temp_ind*3+2];
Also, you can write this in one line:
local_p.col(cnt) = Eigen::Vector3d::Map(p+temp_ind*3);
This line will not fit (unless diff==3):
Eigen::Matrix3d centered = local_p.rowwise() - local_p.colwise().mean();
What you probably mean is (local_p is actually 3xn not nx3)
Eigen::Matrix<double, 3, Eigen::Dynamic> centered = local_p.colwise() - local_p.rowwise().mean();
And when computing cov_e you need to .adjoint() the second factor, not the first.
You can avoid both 'big' matrices local_p and centered, by directly accumulating Eigen::Matrix3d sum2 and Eigen::Vector3d sum with sum2 += v*v.adjoint() and sum +=v and computing
Eigen::Vector3d mu = sum / diff;
Eigen::Matrix3d cov_e = (sum2 - mu*mu.adjoint()*diff)/(diff-1);

Examples of strict aliasing of pointers in GCC C99, no performance differences

I'm trying to understand the impact of strict aliasing on performance in C99. My goal is to optimize a vector dot product, which takes up a large amount of time in my program (profiled it!). I thought that aliasing could be the problem, but the following code doesn't show any substantial difference between the standard approach and the strict aliasing version, even with vectors of size 100 million. I've also tried to use local variables to avoid aliasing, with similar results.
What's happening?
I'm using gcc-4.7 on OSX 10.7.4. Results are in microseconds.
$ /usr/local/bin/gcc-4.7 -fstrict-aliasing -Wall -std=c99 -O3 -o restrict restrict.c
$ ./restrict
sum: 100000000 69542
sum2: 100000000 70432
sum3: 100000000 70372
sum4: 100000000 69891
$ /usr/local/bin/gcc-4.7 -Wall -std=c99 -O0 -fno-strict-aliasing -o restrict restrict.c
$ ./restrict
sum: 100000000 258487
sum2: 100000000 261349
sum3: 100000000 258829
sum4: 100000000 258129
restrict.c (note this code will need several hundred MB RAM):
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#include <unistd.h>
/* original */
long sum(int *x, int *y, int n)
{
long i, s = 0;
for(i = 0 ; i < n ; i++)
s += x[i] * y[i];
return s;
}
/* restrict */
long sum2(int *restrict x, int *restrict y, int n)
{
long i, s = 0;
for(i = 0 ; i < n ; i++)
s += x[i] * y[i];
return s;
}
/* local restrict */
long sum3(int *x, int *y, int n)
{
int *restrict xr = x;
int *restrict yr = y;
long i, s = 0;
for(i = 0 ; i < n ; i++)
s += xr[i] * yr[i];
return s;
}
/* use local variables */
long sum4(int *x, int *y, int n)
{
int xr, yr;
long i, s = 0;
for(i = 0 ; i < n ; i++)
{
xr = x[i];
yr = y[i];
s += xr * yr;
}
return s;
}
int main(void)
{
struct timeval tp1, tp2;
struct timezone tzp;
long i, n = 1e8L, s;
int *x = malloc(sizeof(int) * n);
int *y = malloc(sizeof(int) * n);
long elapsed1;
for(i = 0 ; i < n ; i++)
x[i] = y[i] = 1;
gettimeofday(&tp1, &tzp);
s = sum(x, y, n);
gettimeofday(&tp2, &tzp);
elapsed1 = (tp2.tv_sec - tp1.tv_sec) * 1e6
+ (tp2.tv_usec - tp1.tv_usec);
printf("sum:\t%ld\t%ld\n", s, elapsed1);
gettimeofday(&tp1, &tzp);
s = sum2(x, y, n);
gettimeofday(&tp2, &tzp);
elapsed1 = (tp2.tv_sec - tp1.tv_sec) * 1e6
+ (tp2.tv_usec - tp1.tv_usec);
printf("sum2:\t%ld\t%ld\n", s, elapsed1);
gettimeofday(&tp1, &tzp);
s = sum3(x, y, n);
gettimeofday(&tp2, &tzp);
elapsed1 = (tp2.tv_sec - tp1.tv_sec) * 1e6
+ (tp2.tv_usec - tp1.tv_usec);
printf("sum3:\t%ld\t%ld\n", s, elapsed1);
gettimeofday(&tp1, &tzp);
s = sum3(x, y, n);
gettimeofday(&tp2, &tzp);
elapsed1 = (tp2.tv_sec - tp1.tv_sec) * 1e6
+ (tp2.tv_usec - tp1.tv_usec);
printf("sum4:\t%ld\t%ld\n", s, elapsed1);
return EXIT_SUCCESS;
}
Off the cuff:
with no strict aliasing rules, the compiler might simply generate optimized code that does subtly different things than intended.
It is not a given that disabling strict aliasing rules leads to faster code.
If it does, it's also not a given that the optimized code actually show different results. This depends a lot on the actual data access patterns, and often even the processor/cache architecture.
Regarding your example code, I'd say that aliasing is irrelevant (for emitted code, at least) since there is never any write access to the array elements inside the sumXXX functions.
(You might get slightly better performance (or opposite) if you pass the same vector twice. There might be a boon from hot cache and smaller cache footprint. There may be a penalty from redundant Loads putting the prefetch predictor off-track. As always: use a profiler)

Benchmark problems for testing concurency

For one of the projects I'm doing right now, I need to look at the performance (amongst other things) of different concurrent enabled programming languages.
At the moment I'm looking into comparing stackless python and C++ PThreads, so the focus is on these two languages, but other languages will probably be tested later. Ofcourse the comparison must be as representative and accurate as possible, so my first thought was to start looking for some standard concurrent/multi-threaded benchmark problems, alas I couldn't find any decent or standard, tests/problems/benchmarks.
So my question is as follows: Do you have a suggestion for a good, easy or quick problem to test the performance of the programming language (and to expose it's strong and weak points in the process)?
Surely you should be testing hardware and compilers rather than a language for concurrency performance?
I would be looking at a language from the point of view of how easy and productive it is in terms of concurrency and how much it 'insulates' the programmer from making locking mistakes.
EDIT: from past experience as a researcher designing parallel algorithms, I think you will find in most cases the concurrent performance will depend largely on how an algorithm is parallelised, and how it targets the underlying hardware.
Also, benchmarks are notoriously unequal; this is even more so in a parallel environment. For instance, a benchmark that 'crunches' very large matrices would be suited to a vector pipeline processor, whereas a parallel sort might be better suited to more general purpose multi core CPUs.
These might be useful:
Parallel Benchmarks
NAS Parallel Benchmarks
Well, there are a few classics, but different tests emphasize different features. Some distributed systems may be more robust, have more efficient message-passing, etc. Higher message overhead can hurt scalability, since it the normal way to scale up to more machines is to send a larger number of small messages. Some classic problems you can try are a distributed Sieve of Eratosthenes or a poorly implemented fibonacci sequence calculator (i.e. to calculate the 8th number in the series, spin of a machine for the 7th, and another for the 6th). Pretty much any divide-and-conquer algorithm can be done concurrently. You could also do a concurrent implementation of Conway's game of life or heat transfer. Note that all of these algorithms have different focuses and thus you probably will not get one distributed system doing the best in all of them.
I'd say the easiest one to implement quickly is the poorly implemented fibonacci calculator, though it places too much emphasis on creating threads and too little on communication between those threads.
Surely you should be testing hardware
and compilers rather than a language
for concurrency performance?
No, hardware and compilers are irrelevant for my testing purposes. I'm just looking for some good problems that can test how well code, written in one language, can compete against code from another language. I'm really testing the constructs available in the specific languages to do concurrent programming. And one of the criteria is performance (measured in time).
Some of the other test criteria I'm looking for are:
how easy is it to write correct code; because as we all know concurrent programming is harder then writing single threaded programs
what is the technique used to to concurrent programming: event-driven, actor based, message parsing, ...
how much code must be written by the programmer himself and how much is done automatically for him: this can also be tested with the given benchmark problems
what's the level of abstraction and how much overhead is involved when translated back to machine code
So actually, I'm not looking for performance as the only and best parameter (which would indeed send me to the hardware and the compilers instead of the language itself), I'm actually looking from a programmers point of view to check what language is best suited for what kind of problems, what it's weaknesses and strengths are and so on...
Bare in mind that this is just a small project and the tests are therefore to be kept small as well. (rigorous testing of everything is therefore not feasible)
I have decided to use the Mandelbrot set (the escape time algorithm to be more precise) to benchmark the different languages.
It fits me quite well as the original algorithm can easily be implemented and creating the multi threaded variant from it is not that much work.
below is the code I currently have. It is still a single threaded variant, but I'll update it as soon as I'm satisfied with the result.
#include <cstdlib> //for atoi
#include <iostream>
#include <iomanip> //for setw and setfill
#include <vector>
int DoThread(const double x, const double y, int maxiter) {
double curX,curY,xSquare,ySquare;
int i;
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++) {
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
return i;
}
void SingleThreaded(int horizPixels, int vertPixels, int maxiter, std::vector<std::vector<int> >& result) {
for(int x = horizPixels; x > 0; x--) {
for(int y = vertPixels; y > 0; y--) {
//3.0 -> so we always have -1.5 -> 1.5 as the window; (x - (horizPixels / 2) will go from -horizPixels/2 to +horizPixels/2
result[x-1][y-1] = DoThread((3.0 / horizPixels) * (x - (horizPixels / 2)),(3.0 / vertPixels) * (y - (vertPixels / 2)),maxiter);
}
}
}
int main(int argc, char* argv[]) {
//first arg = length along horizontal axis
int horizPixels = atoi(argv[1]);
//second arg = length along vertical axis
int vertPixels = atoi(argv[2]);
//third arg = iterations
int maxiter = atoi(argv[3]);
//fourth arg = threads
int threadCount = atoi(argv[4]);
std::vector<std::vector<int> > result(horizPixels, std::vector<int>(vertPixels,0)); //create and init 2-dimensional vector
SingleThreaded(horizPixels, vertPixels, maxiter, result);
//TODO: remove these lines
for(int y = 0; y < vertPixels; y++) {
for(int x = 0; x < horizPixels; x++) {
std::cout << std::setw(2) << std::setfill('0') << std::hex << result[x][y] << " ";
}
std::cout << std::endl;
}
}
I've tested it with gcc under Linux, but I'm sure it works under other compilers/Operating Systems as well. To get it to work you have to enter some command line arguments like so:
mandelbrot 106 500 255 1
the first argument is the width (x-axis)
the second argument is the height (y-axis)
the third argument is the number of maximum iterations (the number of colors)
the last ons is the number of threads (but that one is currently not used)
on my resolution, the above example gives me a nice ASCII-art representation of a Mandelbrot set. But try it for yourself with different arguments (the first one will be the most important one, as that will be the width)
Below you can find the code I hacked together to test the multi threaded performance of pthreads. I haven't cleaned it up and no optimizations have been made; so the code is a bit raw.
the code to save the calculated mandelbrot set as a bitmap is not mine, you can find it here
#include <cstdlib> //for atoi
#include <iostream>
#include <iomanip> //for setw and setfill
#include <vector>
#include "bitmap_Image.h" //for saving the mandelbrot as a bmp
#include <pthread.h>
pthread_mutex_t mutexCounter;
int sharedCounter(0);
int percent(0);
int horizPixels(0);
int vertPixels(0);
int maxiter(0);
//doesn't need to be locked
std::vector<std::vector<int> > result; //create 2 dimensional vector
void *DoThread(void *null) {
double curX,curY,xSquare,ySquare,x,y;
int i, intx, inty, counter;
counter = 0;
do {
counter++;
pthread_mutex_lock (&mutexCounter); //lock
intx = int((sharedCounter / vertPixels) + 0.5);
inty = sharedCounter % vertPixels;
sharedCounter++;
pthread_mutex_unlock (&mutexCounter); //unlock
//exit thread when finished
if (intx >= horizPixels) {
std::cout << "exited thread - I did " << counter << " calculations" << std::endl;
pthread_exit((void*) 0);
}
//set x and y to the correct value now -> in the range like singlethread
x = (3.0 / horizPixels) * (intx - (horizPixels / 1.5));
y = (3.0 / vertPixels) * (inty - (vertPixels / 2));
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++){
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
result[intx][inty] = i;
} while (true);
}
int DoSingleThread(const double x, const double y) {
double curX,curY,xSquare,ySquare;
int i;
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++){
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
return i;
}
void SingleThreaded(std::vector<std::vector<int> >& result) {
for(int x = horizPixels - 1; x != -1; x--) {
for(int y = vertPixels - 1; y != -1; y--) {
//3.0 -> so we always have -1.5 -> 1.5 as the window; (x - (horizPixels / 2) will go from -horizPixels/2 to +horizPixels/2
result[x][y] = DoSingleThread((3.0 / horizPixels) * (x - (horizPixels / 1.5)),(3.0 / vertPixels) * (y - (vertPixels / 2)));
}
}
}
void MultiThreaded(int threadCount, std::vector<std::vector<int> >& result) {
/* Initialize and set thread detached attribute */
pthread_t thread[threadCount];
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (int i = 0; i < threadCount - 1; i++) {
pthread_create(&thread[i], &attr, DoThread, NULL);
}
std::cout << "all threads created" << std::endl;
for(int i = 0; i < threadCount - 1; i++) {
pthread_join(thread[i], NULL);
}
std::cout << "all threads joined" << std::endl;
}
int main(int argc, char* argv[]) {
//first arg = length along horizontal axis
horizPixels = atoi(argv[1]);
//second arg = length along vertical axis
vertPixels = atoi(argv[2]);
//third arg = iterations
maxiter = atoi(argv[3]);
//fourth arg = threads
int threadCount = atoi(argv[4]);
result = std::vector<std::vector<int> >(horizPixels, std::vector<int>(vertPixels,21)); // init 2-dimensional vector
if (threadCount <= 1) {
SingleThreaded(result);
} else {
MultiThreaded(threadCount, result);
}
//TODO: remove these lines
bitmapImage image(horizPixels, vertPixels);
for(int y = 0; y < vertPixels; y++) {
for(int x = 0; x < horizPixels; x++) {
image.setPixelRGB(x,y,16777216*result[x][y]/maxiter % 256, 65536*result[x][y]/maxiter % 256, 256*result[x][y]/maxiter % 256);
//std::cout << std::setw(2) << std::setfill('0') << std::hex << result[x][y] << " ";
}
std::cout << std::endl;
}
image.saveToBitmapFile("~/Desktop/test.bmp",32);
}
good results can be obtained using the program with the following arguments:
mandelbrot 5120 3840 256 3
that way you will get an image that is 5 * 1024 wide; 5 * 768 high with 256 colors (alas you will only get 1 or 2) and 3 threads (1 main thread that doesn't do any work except creating the worker threads, and 2 worker threads)
Since the benchmarks game moved to a quad-core machine September 2008, many programs in different programming languages have been re-written to exploit quad-core - for example, the first 10 mandelbrot programs.

Resources