Related
I had a question in my assignment to find whether a number was perfect square or not:
Perfect square is an element of algebraic structure which is equal to
the square of another element.
For example: 4, 9, 16 etc.
What my friends did is, if n is the number, they looped n - 1 times calculating n * n:
// just a general gist
int is_square = 0;
for (int i = 2; i < n; i++)
{
if ((i * i) == n)
{
std::cout << "Yes , it is";
is_square = 1;
break;
}
}
if (is_square == 0)
{
std::cout << "No, it is not";
}
I came up with a solution as shown below:
if (ceil(sqrt(n)) == floor(sqrt(n)))
{
std::cout << "Yes , it is";
}
else
{
std::cout << "no , it is not";
}
And it works properly.
Can it be called as more optimized solution than others?
The tried and true remains:
double sqrt(double x); // from lib
bool is_sqr(long n) {
long root = sqrt(n);
return root * root == n;
}
You would need to know the complexity of the sqrt(x) function on your computer to compare it against other methods. By the way, you are calling sqrt(n) twice ; consider storing it into a variable (even if the compiler probably does this for you).
If using something like Newton's method, the complexity of sqrt(x) is in O(M(d)) where M(d) measures the time required to multiply two d-digits numbers. Wikipedia
Your friend's method does (n - 2) multiplications (worst case scenario), thus its complexity is like O(n * M(x)) where x is a growing number.
Your version only uses sqrt() (ceil and floor can be ignored because of their constant complexity), which makes it O(M(n)).
O(M(n)) < O(n * M(x)) - Your version is more optimized than your friend's, but is not the most efficient. Have a look at interjay's link for a better approach.
#include <iostream>
using namespace std;
void isPerfect(int n){
int ctr=0,i=1;
while(n>0){
n-=i;
i+=2;
ctr++;
}
if(!n)cout<<"\nSquare root = "<<ctr;
else cout<<"\nNot a perfect square";
}
int main() {
isPerfect(3);
isPerfect(2025);
return 0;
}
I don't know what limitations do you have but perfect square number definition is clear
Another way of saying that a (non-negative) number is a square number,
is that its square roots are again integers
in wikipedia
IF SQRT(n) == FLOOR(SQRT(n)) THEN
WRITE "Yes it is"
ELSE
WRITE "No it isn't"
examples sqrt(9) == floor(sqrt(9)), sqrt(10) == floor(sqrt(10))
I'll recommend this one
if(((unsigned long long)sqrt(Number))%1==0) // sqrt() from math library
{
//Code goes Here
}
It works because.... all Integers( & only positive integers ) are positive multiples of 1
and Hence the solution.....
You can also run a benchmark Test like;
I did with following code in MSVC 2012
#include <iostream>
#include <conio.h>
#include <time.h>
#include <math.h>
using namespace std;
void IsPerfect(unsigned long long Number);
void main()
{
clock_t Initial,Final;
unsigned long long Counter=0;
unsigned long long Start=Counter;
Start--;
cout<<Start<<endl;
Initial=clock();
for( Counter=0 ; Counter<=100000000 ; Counter++ )
{
IsPerfect( Counter );
}
Final=clock();
float Time((float)Final-(float)Initial);
cout<<"Calculations Done in "<< Time ;
getch();
}
void IsPerfect( unsigned long long Number)
{
if(ceil(sqrt(Number)) == floor(sqrt(Number)))
//if(((unsigned long long)sqrt(Number))%1==0) // sqrt() from math library
{
}
}
Your code took 13590 time units
Mine just 10049 time units
Moreover I'm using few extra steps that is Type conversion
like
(unsigned long long)sqrt(Number))
Without this it can do still better
I hope it helps .....
Have a nice day....
Your solutions is more optimized, but it may not work. Since sqrt(x) may return the true square root +/- epsilon, 3 different roots must be tested:
bool isPerfect(long x){
double k = round( sqrt(x) );
return (n==(k-1)*(k-1)) || (n==k*k) || (n==(k+1)*(k+1));
}
This is simple python code for finding perfect square r not:
import math
n=(int)(input())
giv=(str)(math.sqrt(n))
if(len(giv.split('.')[1])==1):
print ("yes")
else:
print ("No")
I do not intend to use this for security purposes or statistical analysis. I need to create a simple random number generator for use in my computer graphics application. I don't want to use the term "random number generator", since people think in very strict terms about it, but I can't think of any other word to describe it.
it has to be fast.
it must be repeatable, given a particular seed.
Eg: If seed = x, then the series a,b,c,d,e,f..... should happen every time I use the seed x.
Most importantly, I need to be able to compute the nth term in the series in constant time.
It seems, that I cannot achieve this with rand_r or srand(), since these need are state dependent, and I may need to compute the nth in some unknown order.
I've looked at Linear Feedback Shift registers, but these are state dependent too.
So far I have this:
int rand = (n * prime1 + seed) % prime2
n = used to indicate the index of the term in the sequence. Eg: For
first term, n ==1
prime1 and prime2 are prime numbers where
prime1 > prime2
seed = some number which allows one to use the same function to
produce a different series depending on the seed, but the same series
for a given seed.
I can't tell how good or bad this is, since I haven't used it enough, but it would be great if people with more experience in this can point out the problems with this, or help me improve it..
EDIT - I don't care if it is predictable. I'm just trying to creating some randomness in my computer graphics.
Use a cryptographic block cipher in CTR mode. The Nth output is just encrypt(N). Not only does this give you the desired properties (O(1) computation of the Nth output); it also has strong non-predictability properties.
I stumbled on this a while back, looking for a solution for the same problem. Recently, I figured out how to do it in low-constant O(log(n)) time. While this doesn't quite match the O(1) requested by the author, It may be fast enough (a sample run, compiled with -O3, achieved performance of 1 billion arbitrary index random numbers, with n varying between 1 and 2^48, in 55.7s -- just shy of 18M numbers/s).
First, the theory behind the solution:
A common type of RNGs are Linear Congruential Generators, basically, they work as follows:
random(n) = (m*random(n-1) + b) mod p
Where m and b, and p are constants (see a reference on LCGs for how they are chosen). From this, we can devise the following using a bit of modular arithmetic:
random(0) = seed mod p
random(1) = m*seed + b mod p
random(2) = m^2*seed + m*b + b mod p
...
random(n) = m^n*seed + b*Sum_{i = 0 to n - 1} m^i mod p
= m^n*seed + b*(m^n - 1)/(m - 1) mod p
Computing the above can be a problem, since the numbers will quickly exceed numeric limits. The solution for the generic case is to compute m^n in modulo with p*(m - 1), however, if we take b = 0 (a sub-case of LCGs sometimes called Multiplicative congruential Generators), we have a much simpler solution, and can do our computations in modulo p only.
In the following, I use the constant parameters used by RANF (developed by CRAY), where p = 2^48 and g = 44485709377909. The fact that p is a power of 2 reduces the number of operations required (as expected):
#include <cassert>
#include <stdint.h>
#include <cstdlib>
class RANF{
// MCG constants and state data
static const uint64_t m = 44485709377909ULL;
static const uint64_t n = 0x0000010000000000ULL; // 2^48
static const uint64_t randMax = n - 1;
const uint64_t seed;
uint64_t state;
public:
// Constructors, which define the seed
RANF(uint64_t seed) : seed(seed), state(seed) {
assert(seed > 0 && "A seed of 0 breaks the LCG!");
}
// Gets the next random number in the sequence
inline uint64_t getNext(){
state *= m;
return state & randMax;
}
// Sets the MCG to a specific index
inline void setPosition(size_t index){
state = seed;
uint64_t mPower = m;
for (uint64_t b = 1; index; b <<= 1){
if (index & b){
state *= mPower;
index ^= b;
}
mPower *= mPower;
}
}
};
#include <cstdio>
void example(){
RANF R(1);
// Gets the number through random-access -- O(log(n))
R.setPosition(12345); // Goes to the nth random number
printf("fast nth number = %lu\n", R.getNext());
// Gets the number through standard, sequential access -- O(n)
R.setPosition(0);
for(size_t i = 0; i < 12345; i++) R.getNext();
printf("slow nth number = %lu\n", R.getNext());
}
While I presume the author has moved on by now, hopefully this will be of use to someone else.
If you're really concerned about runtime performance, the above can be made about 10x faster with lookup tables, at the cost of compilation time and binary size (it also is O(1) w.r.t the desired random index, as requested by OP)
In the version below, I used c++14 constexpr to generate the lookup tables at compile time, and got to 176M arbitrary index random numbers per second (doing this did however add about 12s of extra compilation time, and a 1.5MB increase in binary size -- the added time may be mitigated if partial recompilation is used).
class RANF{
// MCG constants and state data
static const uint64_t m = 44485709377909ULL;
static const uint64_t n = 0x0000010000000000ULL; // 2^48
static const uint64_t randMax = n - 1;
const uint64_t seed;
uint64_t state;
// Lookup table
struct lookup_t{
uint64_t v[3][65536];
constexpr lookup_t() : v() {
uint64_t mi = RANF::m;
for (size_t i = 0; i < 3; i++){
v[i][0] = 1;
uint64_t val = mi;
for (uint16_t j = 0x0001; j; j++){
v[i][j] = val;
val *= mi;
}
mi = val;
}
}
};
friend struct lookup_t;
public:
// Constructors, which define the seed
RANF(uint64_t seed) : seed(seed), state(seed) {
assert(seed > 0 && "A seed of 0 breaks the LCG!");
}
// Gets the next random number in the sequence
inline uint64_t getNext(){
state *= m;
return state & randMax;
}
// Sets the MCG to a specific index
// Note: idx.u16 indices need to be adapted for big-endian machines!
inline void setPosition(size_t index){
static constexpr auto lookup = lookup_t();
union { uint16_t u16[4]; uint64_t u64; } idx;
idx.u64 = index;
state = seed * lookup.v[0][idx.u16[0]] * lookup.v[1][idx.u16[1]] * lookup.v[2][idx.u16[2]];
}
};
Basically, what this does is splits the computation of, for example, m^0xAAAABBBBCCCC mod p, into (m^0xAAAA00000000 mod p)*(m^0xBBBB0000 mod p)*(m^CCCC mod p) mod p, and then precomputes tables for each of the values in the 0x0000 - 0xFFFF range that could fill AAAA, BBBB or CCCC.
RNG in a normal sense, have the sequence pattern like f(n) = S(f(n-1))
They also lost precision at some point (like % mod), due to computing convenience, therefore it is not possible to expand the sequence to a function like X(n) = f(n) = trivial function with n only.
This mean at best you have O(n) with that.
To target for O(1) you therefore need to abandon the idea of f(n) = S(f(n-1)), and designate a trivial formula directly so that the N'th number can be calculated directly without knowing (N-1)'th; this also render the seed meaningless.
So, you end up have a simple algebra function and not a sequence. For example:
int my_rand(int n) { return 42; } // Don't laugh!
int my_rand(int n) { 3*n*n + 2*n + 7; }
If you want to put more constraint to the generated pattern (like distribution), it become a complex maths problem.
However, for your original goal, if what you want is constant speed to get pseudo-random numbers, I suggest to pre-generate it with traditional RNG and access with lookup table.
EDIT: I noticed you have concern with a table size for a lot of numbers, however you may introduce some hybrid model, like a table of N entries, and do f(k) = g( tbl[k%n], k), which at least provide good distribution across N continue sequence.
This demonstrates an PRNG implemented as a hashed counter. This might appear to duplicate R.'s suggestion (using a block cipher in CTR mode as a stream cipher), but for this, I avoided using cryptographically secure primitives: for speed of execution and because security wasn't a desired feature.
If we were trying to create a secure stream cipher with your requirement that any emitted sequence be trivially repeatable, given knowledge of its index...
...then we could choose a secure hash algorithm (like SHA256) and a counter with a lot of bits (maybe 2048 -> sequence repeats every 2^2048 generated numbers without reseeding).
HOWEVER, the version I present here uses Bob Jenkins' famous hash function (simple and fast, but not secure) along with a 64-bit counter (which is as big as integers can get on my system, without needing custom incrementing code).
Code in main demonstrates that knowledge of the RNG's counter (seed) after initialization allows a PRNG sequence to be repeated, as long as we know how many values were generated leading up to the repetition point.
Actually, if you know the counter's value at any point in the output sequence, you will be able to retrieve all values generated previous to that point, AND all values which will be generated afterward. This only involves adding or subtracting ordinal differences to/from a reference counter value associated with a known point in the output sequence.
It should be pretty easy to adapt this class for use as a testing framework -- you could plug in other hash functions and change the counter's size to see what kind of impact there is on speed as well as the distribution of generated values (the only uniformity analysis I did was to look for patterns in the screenfuls of hexadecimal numbers printed by main()).
#include <iostream>
#include <iomanip>
#include <ctime>
using namespace std;
class CHashedCounterRng {
static unsigned JenkinsHash(const void *input, unsigned len) {
unsigned hash = 0;
for(unsigned i=0; i<len; ++i) {
hash += static_cast<const unsigned char*>(input)[i];
hash += hash << 10;
hash ^= hash >> 6;
}
hash += hash << 3;
hash ^= hash >> 11;
hash += hash << 15;
return hash;
}
unsigned long long m_counter;
void IncrementCounter() { ++m_counter; }
public:
unsigned long long GetSeed() const {
return m_counter;
}
void SetSeed(unsigned long long new_seed) {
m_counter = new_seed;
}
unsigned int operator ()() {
// the next random number is generated here
const auto r = JenkinsHash(&m_counter, sizeof(m_counter));
IncrementCounter();
return r;
}
// the default coontructor uses time()
// to seed the counter
CHashedCounterRng() : m_counter(time(0)) {}
// you can supply a predetermined seed here,
// or after construction with SetSeed(seed)
CHashedCounterRng(unsigned long long seed) : m_counter(seed) {}
};
int main() {
CHashedCounterRng rng;
// time()'s high bits change very slowly, so look at low digits
// if you want to verify that the seed is different between runs
const auto stored_counter = rng.GetSeed();
cout << "initial seed: " << stored_counter << endl;
for(int i=0; i<20; ++i) {
for(int j=0; j<8; ++j) {
const unsigned x = rng();
cout << setfill('0') << setw(8) << hex << x << ' ';
}
cout << endl;
}
cout << endl;
cout << "The last line again:" << endl;
rng.SetSeed(stored_counter + 19 * 8);
for(int j=0; j<8; ++j) {
const unsigned x = rng();
cout << setfill('0') << setw(8) << hex << x << ' ';
}
cout << endl << endl;
return 0;
}
I need some help to parallelize the pi calculation with the monte carlo method with openmp by a given random number generator, which is not thread safe.
First: This SO thread didn't help me.
My own try is the following #pragma omp statements. I thought the i, x and y vars should be init by each thread and should than be private. z ist the sum of all hits in the circle, so it should be summed after the implied barriere after the for loop.
Think the main problem ist the static state var of the random number generator. I made a critical section where the functions are called, so that only one thread per time could execute it. But the Pi solutions doesn't scale with more higher values.
Note: I should not use another RNG, but its okay to make little changes on it.
int main (int argc, char *argv[]) {
int i, z = 0, threads = 8, iters = 100000;
double x,y, pi;
#pragma omp parallel firstprivate(i,x,y) reduction(+:z) num_threads(threads)
for (i=0; i<iters; ++i) {
#pragma omp critical
{
x = rng_doub(1.0);
y = rng_doub(1.0);
}
if ((x*x+y*y) <= 1.0)
z++;
}
pi = ((double) z / (double) (iters*threads))*4.0;
printf("Pi: %lf\n", pi);;
return 0;
}
This RNG is actually an included file, but as I'm not sure if I create the header file correct, I integrated it in the other program file, so I have only one .c file.
#define RNG_MOD 741025
int rng_int(void) {
static int state = 0;
return (state = (1366 * state + 150889) % RNG_MOD);
}
double rng_doub(double range) {
return ((double) rng_int()) / (double) ((RNG_MOD - 1)/range);
}
I've also tried to make the static int state global, but it doesn't change my result, maybe I done it wrong. So please could you help me make the correct changes? Thank you very much!
Your original linear congruent PRNG has a cycle length of 49400, therefore you are only getting 29700 unique test points. This is a terrible generator to be used for any kind of Monte Carlo simulations. Even if you make 100000000 trials, you won't get any closer to the true value of Pi because you are simply repeating the same points over and over again and as a result both the final value of z and iters are simply multiplied by the same constant, which cancel in the end during the division.
The per-thread seed introduced by Z boson improves the situation a little bit with the number of unique points increasing with the total number of OpenMP threads. The increase is not linear since if the seed of one PRNG falls in the sequence of another PRNG, both PRNGs produce the same sequence shifted with no more than 49400 elements. Given the cycle length, each PRNG covers 49400/RNG_MOD = 6,7% of the total output range and that is the probability of two PRNGs being synchronised. There are a total of RNG_MOD/49400 = 15 unique sequences possible. It basically means that in the best seeding case scenario you won't be able to get past 30 threads as any other thread would simply repeat the result of some of the others. The multiplier 2 comes from the fact that each point uses two elements from the sequence and therefore it is possible to get a different set of points if you shift the sequence by one element.
The ultimate solution is to completely drop your PRNG and stick to something like Mersenne twister MT19937, which has a cycle length of 219937 − 1 and a very strong seeding algorithm. If you are not able to use another PRNG as you state in your question, at least modify the constants of the LCG to match those used in rand():
int rng_int(void) {
static int state = 1;
// & 0x7fffffff is equivalent to modulo with RNG_MOD = 2^31
return (state = (state * 1103515245 + 12345) & 0x7fffffff);
}
Note that rand() is not a good PRNG - it is still bad. It is just a little better than the one used in your code.
Try the code below. It makes a private state for each thread. I did something similar with the at rand_r function Why does calculation with OpenMP take 100x more time than with a single thread?
Edit: I updated my code using some of Hristo's suggestions. I used threadprivate (for the first time). I also used a better rand function which gives a better estimate of pi but it's still not good enough.
One strange things was I had to define the function rng_int after threadprivate otherwise I got an error "error: 'state' declared 'threadprivate' after first use". I should probably ask a question about this.
//gcc -O3 -Wall -pedantic -fopenmp main.c
#include <omp.h>
#include <stdio.h>
#define RNG_MOD 0x80000000
int state;
int rng_int(void);
double rng_doub(double range);
int main() {
int i, numIn, n;
double x, y, pi;
n = 1<<30;
numIn = 0;
#pragma omp threadprivate(state)
#pragma omp parallel private(x, y) reduction(+:numIn)
{
state = 25234 + 17 * omp_get_thread_num();
#pragma omp for
for (i = 0; i <= n; i++) {
x = (double)rng_doub(1.0);
y = (double)rng_doub(1.0);
if (x*x + y*y <= 1) numIn++;
}
}
pi = 4.*numIn / n;
printf("asdf pi %f\n", pi);
return 0;
}
int rng_int(void) {
// & 0x7fffffff is equivalent to modulo with RNG_MOD = 2^31
return (state = (state * 1103515245 + 12345) & 0x7fffffff);
}
double rng_doub(double range) {
return ((double)rng_int()) / (((double)RNG_MOD)/range);
}
You can see the results (and edit and run the code) at http://coliru.stacked-crooked.com/a/23c1753a1b7d1b0d
So, we see a lot of fibonacci questions. I, personally, hate them. A lot. More than a lot. I thought it'd be neat if maybe we could make it impossible for anyone to ever use it as an interview question again. Let's see how close to O(1) we can get fibonacci.
Here's my kick off, pretty much crib'd from Wikipedia, with of course plenty of headroom. Importantly, this solution will detonate for any particularly large fib, and it contains a relatively naive use of the power function, which places it at O(log(n)) at worst, if your libraries aren't good. I suspect we can get rid of the power function, or at least specialize it. Anyone up for helping? Is there a true O(1) solution, other than the finite* solution of using a look-up table?
http://ideone.com/FDt3P
#include <iostream>
#include <math.h>
using namespace std; // would never normally do this.
int main() {
int target = 10;
cin >> target;
// should be close enough for anything that won't make us explode anyway.
float mangle = 2.23607610;
float manglemore = mangle;
++manglemore; manglemore = manglemore / 2;
manglemore = pow(manglemore, target);
manglemore = manglemore/mangle;
manglemore += .5;
cout << floor(manglemore);
}
*I know, I know, it's enough for any of the zero practical uses fibonacci has.
Here is a near O(1) solution for a Fibonacci sequence term. Admittedly, O(log n) depending on the system Math.pow() implementation, but it is Fibonacci w/o a visible loop, if your interviewer is looking for that. The ceil() was due to rounding precision on larger values returning .9 repeating.
Example in JS:
function fib (n) {
var A=(1+Math.sqrt(5))/2,
B=(1-Math.sqrt(5))/2,
fib = (Math.pow(A,n) - Math.pow(B,n)) / Math.sqrt(5);
return Math.ceil(fib);
}
Given arbitrary large inputs, simply reading in n takes O(log n), so in that sense no constant time algorithm is possible. So, use the closed form solution, or precompute the values you care about, to get reasonable performance.
Edit: In comments it was pointed out that it is actually worse, because fibonacci is O(phi^n) printing the result of Fibonacci is O(log (phi^n)) which is O(n)!
The following answer executes in O(1), though I am not sure whether it is qualified for you question. It is called Template Meta-Programming.
#include <iostream>
using namespace std;
template <int N>
class Fibonacci
{
public:
enum {
value = Fibonacci<N - 1>::value + Fibonacci<N - 2>::value
};
};
template <>
class Fibonacci<0>
{
public:
enum {
value = 0
};
};
template <>
class Fibonacci<1>
{
public:
enum {
value = 1
};
};
int main()
{
cout << Fibonacci<50>::value << endl;
return 0;
}
In Programming: The Derivation of Algorithms, Anne Kaldewaij expands out the linear algebra solution to get (translated and refactored from the programming language used in that book):
template <typename Int_t> Int_t fib(Int_t n)
{
Int_t a = 0, b = 1, x = 0, y 1, t0, t1;
while (n != 0) {
switch(n % 2) {
case 1:
t0 = a * x + b * y;
t1 = b * x + a * y + b * y;
x = t0;
y = t1;
--n;
continue;
default:
t0 = a * a + b * b;
t1 = 2 * a * b + b * b;
a = t0;
b = t1;
n /= 2;
continue;
}
}
return x;
}
This has O(log n) complexity. That's not constant, of course, but I think it's worth adding to the discussion, especially given that it only uses relatively fast integer operations and has no possibility of rounding error.
Yes. Precalculate the values, and store in an array,
then use N to do a lookup.
Pick some largest value to handle. For any larger value, raise an error. For any smaller value than that, just store the answer at that smaller value, and keep running the calculation for the "largest" value, and return the stored value.
After all, O(1) specifically means "constant", not "fast". With this method, all calculations will take the same amount of time.
Fibonacci in O(1) space and time (Python implementation):
PHI = (1 + sqrt(5)) / 2
def fib(n: int):
return int(PHI ** n / sqrt(5) + 0.5)
For one of the projects I'm doing right now, I need to look at the performance (amongst other things) of different concurrent enabled programming languages.
At the moment I'm looking into comparing stackless python and C++ PThreads, so the focus is on these two languages, but other languages will probably be tested later. Ofcourse the comparison must be as representative and accurate as possible, so my first thought was to start looking for some standard concurrent/multi-threaded benchmark problems, alas I couldn't find any decent or standard, tests/problems/benchmarks.
So my question is as follows: Do you have a suggestion for a good, easy or quick problem to test the performance of the programming language (and to expose it's strong and weak points in the process)?
Surely you should be testing hardware and compilers rather than a language for concurrency performance?
I would be looking at a language from the point of view of how easy and productive it is in terms of concurrency and how much it 'insulates' the programmer from making locking mistakes.
EDIT: from past experience as a researcher designing parallel algorithms, I think you will find in most cases the concurrent performance will depend largely on how an algorithm is parallelised, and how it targets the underlying hardware.
Also, benchmarks are notoriously unequal; this is even more so in a parallel environment. For instance, a benchmark that 'crunches' very large matrices would be suited to a vector pipeline processor, whereas a parallel sort might be better suited to more general purpose multi core CPUs.
These might be useful:
Parallel Benchmarks
NAS Parallel Benchmarks
Well, there are a few classics, but different tests emphasize different features. Some distributed systems may be more robust, have more efficient message-passing, etc. Higher message overhead can hurt scalability, since it the normal way to scale up to more machines is to send a larger number of small messages. Some classic problems you can try are a distributed Sieve of Eratosthenes or a poorly implemented fibonacci sequence calculator (i.e. to calculate the 8th number in the series, spin of a machine for the 7th, and another for the 6th). Pretty much any divide-and-conquer algorithm can be done concurrently. You could also do a concurrent implementation of Conway's game of life or heat transfer. Note that all of these algorithms have different focuses and thus you probably will not get one distributed system doing the best in all of them.
I'd say the easiest one to implement quickly is the poorly implemented fibonacci calculator, though it places too much emphasis on creating threads and too little on communication between those threads.
Surely you should be testing hardware
and compilers rather than a language
for concurrency performance?
No, hardware and compilers are irrelevant for my testing purposes. I'm just looking for some good problems that can test how well code, written in one language, can compete against code from another language. I'm really testing the constructs available in the specific languages to do concurrent programming. And one of the criteria is performance (measured in time).
Some of the other test criteria I'm looking for are:
how easy is it to write correct code; because as we all know concurrent programming is harder then writing single threaded programs
what is the technique used to to concurrent programming: event-driven, actor based, message parsing, ...
how much code must be written by the programmer himself and how much is done automatically for him: this can also be tested with the given benchmark problems
what's the level of abstraction and how much overhead is involved when translated back to machine code
So actually, I'm not looking for performance as the only and best parameter (which would indeed send me to the hardware and the compilers instead of the language itself), I'm actually looking from a programmers point of view to check what language is best suited for what kind of problems, what it's weaknesses and strengths are and so on...
Bare in mind that this is just a small project and the tests are therefore to be kept small as well. (rigorous testing of everything is therefore not feasible)
I have decided to use the Mandelbrot set (the escape time algorithm to be more precise) to benchmark the different languages.
It fits me quite well as the original algorithm can easily be implemented and creating the multi threaded variant from it is not that much work.
below is the code I currently have. It is still a single threaded variant, but I'll update it as soon as I'm satisfied with the result.
#include <cstdlib> //for atoi
#include <iostream>
#include <iomanip> //for setw and setfill
#include <vector>
int DoThread(const double x, const double y, int maxiter) {
double curX,curY,xSquare,ySquare;
int i;
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++) {
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
return i;
}
void SingleThreaded(int horizPixels, int vertPixels, int maxiter, std::vector<std::vector<int> >& result) {
for(int x = horizPixels; x > 0; x--) {
for(int y = vertPixels; y > 0; y--) {
//3.0 -> so we always have -1.5 -> 1.5 as the window; (x - (horizPixels / 2) will go from -horizPixels/2 to +horizPixels/2
result[x-1][y-1] = DoThread((3.0 / horizPixels) * (x - (horizPixels / 2)),(3.0 / vertPixels) * (y - (vertPixels / 2)),maxiter);
}
}
}
int main(int argc, char* argv[]) {
//first arg = length along horizontal axis
int horizPixels = atoi(argv[1]);
//second arg = length along vertical axis
int vertPixels = atoi(argv[2]);
//third arg = iterations
int maxiter = atoi(argv[3]);
//fourth arg = threads
int threadCount = atoi(argv[4]);
std::vector<std::vector<int> > result(horizPixels, std::vector<int>(vertPixels,0)); //create and init 2-dimensional vector
SingleThreaded(horizPixels, vertPixels, maxiter, result);
//TODO: remove these lines
for(int y = 0; y < vertPixels; y++) {
for(int x = 0; x < horizPixels; x++) {
std::cout << std::setw(2) << std::setfill('0') << std::hex << result[x][y] << " ";
}
std::cout << std::endl;
}
}
I've tested it with gcc under Linux, but I'm sure it works under other compilers/Operating Systems as well. To get it to work you have to enter some command line arguments like so:
mandelbrot 106 500 255 1
the first argument is the width (x-axis)
the second argument is the height (y-axis)
the third argument is the number of maximum iterations (the number of colors)
the last ons is the number of threads (but that one is currently not used)
on my resolution, the above example gives me a nice ASCII-art representation of a Mandelbrot set. But try it for yourself with different arguments (the first one will be the most important one, as that will be the width)
Below you can find the code I hacked together to test the multi threaded performance of pthreads. I haven't cleaned it up and no optimizations have been made; so the code is a bit raw.
the code to save the calculated mandelbrot set as a bitmap is not mine, you can find it here
#include <cstdlib> //for atoi
#include <iostream>
#include <iomanip> //for setw and setfill
#include <vector>
#include "bitmap_Image.h" //for saving the mandelbrot as a bmp
#include <pthread.h>
pthread_mutex_t mutexCounter;
int sharedCounter(0);
int percent(0);
int horizPixels(0);
int vertPixels(0);
int maxiter(0);
//doesn't need to be locked
std::vector<std::vector<int> > result; //create 2 dimensional vector
void *DoThread(void *null) {
double curX,curY,xSquare,ySquare,x,y;
int i, intx, inty, counter;
counter = 0;
do {
counter++;
pthread_mutex_lock (&mutexCounter); //lock
intx = int((sharedCounter / vertPixels) + 0.5);
inty = sharedCounter % vertPixels;
sharedCounter++;
pthread_mutex_unlock (&mutexCounter); //unlock
//exit thread when finished
if (intx >= horizPixels) {
std::cout << "exited thread - I did " << counter << " calculations" << std::endl;
pthread_exit((void*) 0);
}
//set x and y to the correct value now -> in the range like singlethread
x = (3.0 / horizPixels) * (intx - (horizPixels / 1.5));
y = (3.0 / vertPixels) * (inty - (vertPixels / 2));
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++){
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
result[intx][inty] = i;
} while (true);
}
int DoSingleThread(const double x, const double y) {
double curX,curY,xSquare,ySquare;
int i;
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++){
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
return i;
}
void SingleThreaded(std::vector<std::vector<int> >& result) {
for(int x = horizPixels - 1; x != -1; x--) {
for(int y = vertPixels - 1; y != -1; y--) {
//3.0 -> so we always have -1.5 -> 1.5 as the window; (x - (horizPixels / 2) will go from -horizPixels/2 to +horizPixels/2
result[x][y] = DoSingleThread((3.0 / horizPixels) * (x - (horizPixels / 1.5)),(3.0 / vertPixels) * (y - (vertPixels / 2)));
}
}
}
void MultiThreaded(int threadCount, std::vector<std::vector<int> >& result) {
/* Initialize and set thread detached attribute */
pthread_t thread[threadCount];
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (int i = 0; i < threadCount - 1; i++) {
pthread_create(&thread[i], &attr, DoThread, NULL);
}
std::cout << "all threads created" << std::endl;
for(int i = 0; i < threadCount - 1; i++) {
pthread_join(thread[i], NULL);
}
std::cout << "all threads joined" << std::endl;
}
int main(int argc, char* argv[]) {
//first arg = length along horizontal axis
horizPixels = atoi(argv[1]);
//second arg = length along vertical axis
vertPixels = atoi(argv[2]);
//third arg = iterations
maxiter = atoi(argv[3]);
//fourth arg = threads
int threadCount = atoi(argv[4]);
result = std::vector<std::vector<int> >(horizPixels, std::vector<int>(vertPixels,21)); // init 2-dimensional vector
if (threadCount <= 1) {
SingleThreaded(result);
} else {
MultiThreaded(threadCount, result);
}
//TODO: remove these lines
bitmapImage image(horizPixels, vertPixels);
for(int y = 0; y < vertPixels; y++) {
for(int x = 0; x < horizPixels; x++) {
image.setPixelRGB(x,y,16777216*result[x][y]/maxiter % 256, 65536*result[x][y]/maxiter % 256, 256*result[x][y]/maxiter % 256);
//std::cout << std::setw(2) << std::setfill('0') << std::hex << result[x][y] << " ";
}
std::cout << std::endl;
}
image.saveToBitmapFile("~/Desktop/test.bmp",32);
}
good results can be obtained using the program with the following arguments:
mandelbrot 5120 3840 256 3
that way you will get an image that is 5 * 1024 wide; 5 * 768 high with 256 colors (alas you will only get 1 or 2) and 3 threads (1 main thread that doesn't do any work except creating the worker threads, and 2 worker threads)
Since the benchmarks game moved to a quad-core machine September 2008, many programs in different programming languages have been re-written to exploit quad-core - for example, the first 10 mandelbrot programs.