Fast block placement algorithm, advice needed? - algorithm

I need to emulate the window placement strategy of the Fluxbox window manager.
As a rough guide, visualize randomly sized windows filling up the screen one at a time, where the rough size of each results in an average of 80 windows on screen without any window overlapping another.
If you have Fluxbox and Xterm installed on your system, you can try the xwinmidiarptoy BASH script to see a rough prototype of what I want happening. See the xwinmidiarptoy.txt notes I've written about it explaining what it does and how it should be used.
It is important to note that windows will close and the space that closed windows previously occupied becomes available once more for the placement of new windows.
The algorithm needs to be an Online Algorithm processing data "piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start."
The Fluxbox window placement strategy has three binary options which I want to emulate:
Windows build horizontal rows or vertical columns (potentially)
Windows are placed from left to right or right to left
Windows are placed from top to bottom or bottom to top
Differences between the target algorithm and a window-placement algorithm
The coordinate units are not pixels. The grid within which blocks will be placed will be 128 x 128 units. Furthermore, the placement area may be further shrunk by a boundary area placed within the grid.
Why is the algorithm a problem?
It needs to operate to the deadlines of a real time thread in an audio application.
At this moment I am only concerned with getting a fast algorithm, don't concern yourself over the implications of real time threads and all the hurdles in programming that that brings.
And although the algorithm will never ever place a window which overlaps another, the user will be able to place and move certain types of blocks, overlapping windows will exist. The data structure used for storing the windows and/or free space, needs to be able to handle this overlap.
So far I have two choices which I have built loose prototypes for:
1) A port of the Fluxbox placement algorithm into my code.
The problem with this is, the client (my program) gets kicked out of the audio server (JACK) when I try placing the worst case scenario of 256 blocks using the algorithm. This algorithm performs over 14000 full (linear) scans of the list of blocks already placed when placing the 256th window.
For a demonstration of this I created a program called text_boxer-0.0.2.tar.bz2 which takes a text file as input and arranges it within ASCII boxes. Issue make to build it. A little unfriendly, use --help (or any other invalid option) for a list of command line options. You must specify the text file by using the option.
2) My alternative approach.
Only partially implemented, this approach uses a data structure for each area of rectangular free unused space (the list of windows can be entirely separate, and is not required for testing of this algorithm). The data structure acts as a node in a doubly linked list (with sorted insertion), as well as containing the coordinates of the top-left corner, and the width and height.
Furthermore, each block data structure also contains four links which connect to each immediately adjacent (touching) block on each of the four sides.
IMPORTANT RULE: Each block may only touch with one block per side. This is a rule specific to the algorithm's way of storing free unused space and bears no factor in how many actual windows may touch each other.
The problem with this approach is, it's very complex. I have implemented the straightforward cases where 1) space is removed from one corner of a block, 2) splitting neighbouring blocks so that the IMPORTANT RULE is adhered to.
The less straightforward case, where the space to be removed can only be found within a column or row of boxes, is only partially implemented - if one of the blocks to be removed is an exact fit for width (ie column) or height (ie row) then problems occur. And don't even mention the fact this only checks columns one box wide, and rows one box tall.
I've implemented this algorithm in C - the language I am using for this project (I've not used C++ for a few years and am uncomfortable using it after having focused all my attention to C development, it's a hobby). The implementation is 700+ lines of code (including plenty of blank lines, brace lines, comments etc). The implementation only works for the horizontal-rows + left-right + top-bottom placement strategy.
So I've either got to add some way of making this +700 lines of code work for the other 7 placement strategy options, or I'm going to have to duplicate those +700 lines of code for the other seven options. Neither of these is attractive, the first, because the existing code is complex enough, the second, because of bloat.
The algorithm is not even at a stage where I can use it in the real time worst case scenario, because of missing functionality, so I still don't know if it actually performs better or worse than the first approach.
The current state of C implementation of this algorithm is freespace.c. I use gcc -O0 -ggdb freespace.c to build, and run it in an xterm sized to atleast 124 x 60 chars.
What else is there?
I've skimmed over and discounted:
Bin Packing algorithms: their
emphasis on optimal fit does not
match the requirements of this
algorithm.
Recursive Bisection Placement algorithms: sounds promising, but
these are for circuit design. Their
emphasis is optimal wire length.
Both of these, especially the latter, all elements to be placed/packs are known before the algorithm begins.
What are your thoughts on this? How would you approach it? What other algorithms should I look at? Or even what concepts should I research seeing as I've not studied computer science/software engineering?
Please ask questions in comments if further information is needed.
Further ideas developed since asking this question
Some combination of my "alternative algorithm" with a spatial hashmap for identifying if a large window to be placed would cover several blocks of free space.

I would consider some kind of spatial hashing structure. Imagine your entire free space is gridded coarsely, call them blocks. As windows come and go, they occupy certain sets of contiguous rectangular blocks. For each block, keep track of the largest unused rectangle incident to each corner, so you need to store 2*4 real numbers per block. For an empty block, the rectangles at each corner have size equal to the block. Thus, a block can only be "used up" at its corners, and so at most 4 windows can sit in any block.
Now each time you add a window, you have to search for a rectangular set of blocks for which the window will fit, and when you do, update the free corner sizes. You should size your blocks so that a handful (~4x4) of them fit into a typical window. For each window, keep track of which blocks it touches (you only need to keep track of extents), as well as which windows touch a given block (at most 4, in this algorithm). There is an obvious tradeoff between the granularity of the blocks and the amount of work per window insertion/removal.
When removing a window, loop over all blocks it touches, and for each block, recompute the free corner sizes (you know which windows touch it). This is fast since the inner loop is at most length 4.
I imagine a data structure like
struct block{
int free_x[4]; // 0 = top left, 1 = top right,
int free_y[4]; // 2 = bottom left, 3 = bottom right
int n_windows; // number of windows that occupy this block
int window_id[4]; // IDs of windows that occupy this block
};
block blocks[NX][NY];
struct window{
int id;
int used_block_x[2]; // 0 = first index of used block,
int used_block_y[2]; // 1 = last index of used block
};
Edit
Here is a picture:
It shows two example blocks. The colored dots indicate the corners of the block, and the arrows emanating from them indicate the extents of the largest-free-rectangle from that corner.
You mentioned in the edit that the grid on which your windows will be placed is already quite coarse (127x127), so the block sizes would probably be something like 4 grid cells on a side, which probably wouldn't gain you much. This method is suitable if your window corner coordinates can take on a lot of values (I was thinking they would be pixels), but not so much in your case. You can still try it, since it's simple. You would probably want to also keep a list of completely empty blocks so that if a window comes in that is larger than 2 block widths, then you look first in that list before looking for some suitable free space in the block grid.

After some false starts, I have eventually arrived here. Here is where the use of data structures for storing rectangular areas of free space have been abandoned. Instead, there is a 2d array with 128 x 128 elements to achieve the same result but with much less complexity.
The following function scans the array for an area width * height in size. The first position it finds it writes the top left coordinates of, to where resultx and resulty point to.
_Bool freespace_remove( freespace* fs,
int width, int height,
int* resultx, int* resulty)
{
int x = 0;
int y = 0;
const int rx = FSWIDTH - width;
const int by = FSHEIGHT - height;
*resultx = -1;
*resulty = -1;
char* buf[height];
for (y = 0; y < by; ++y)
{
x = 0;
char* scanx = fs->buf[y];
while (x < rx)
{
while(x < rx && *(scanx + x))
++x;
int w, h;
for (h = 0; h < height; ++h)
buf[h] = fs->buf[y + h] + x;
_Bool usable = true;
w = 0;
while (usable && w < width)
{
h = 0;
while (usable && h < height)
if (*(buf[h++] + w))
usable = false;
++w;
}
if (usable)
{
for (w = 0; w < width; ++w)
for (h = 0; h < height; ++h)
*(buf[h] + w) = 1;
*resultx = x;
*resulty = y;
return true;
}
x += w;
}
}
return false;
}
The 2d array is zero initialized. Any areas in the array where the space is used are set to 1. This structure and function will work independently from the actual list of windows that are occupying the areas marked with 1's.
The advantages of this method are its simplicity. It only uses one data structure - an array. The function is short, and should not be too difficult to adapt to handle the remaining placement options (here it only handles Row Smart + Left to Right + Top to Bottom).
My initial tests also look promising on the speed front. Though I don't think this would be suitable for a window manager placing windows on, for example, a 1600 x 1200 desktop with pixel accuracy, for my purposes I believe it is going to be much better than any of the previous methods I have tried.
Compilable test code here:
http://jwm-art.net/art/text/freespace_grid.c
(in Linux I use gcc -ggdb -O0 freespace_grid.c to compile)

#include <limits.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#define FSWIDTH 128
#define FSHEIGHT 128
#ifdef USE_64BIT_ARRAY
#define FSBUFBITS 64
#define FSBUFWIDTH 2
typedef uint64_t fsbuf_type;
#define TRAILING_ZEROS( v ) __builtin_ctzl(( v ))
#define LEADING_ONES( v ) __builtin_clzl(~( v ))
#else
#ifdef USE_32BIT_ARRAY
#define FSBUFBITS 32
#define FSBUFWIDTH 4
typedef uint32_t fsbuf_type;
#define TRAILING_ZEROS( v ) __builtin_ctz(( v ))
#define LEADING_ONES( v ) __builtin_clz(~( v ))
#else
#ifdef USE_16BIT_ARRAY
#define FSBUFBITS 16
#define FSBUFWIDTH 8
typedef uint16_t fsbuf_type;
#define TRAILING_ZEROS( v ) __builtin_ctz( 0xffff0000 | ( v ))
#define LEADING_ONES( v ) __builtin_clz(~( v ) << 16)
#else
#ifdef USE_8BIT_ARRAY
#define FSBUFBITS 8
#define FSBUFWIDTH 16
typedef uint8_t fsbuf_type;
#define TRAILING_ZEROS( v ) __builtin_ctz( 0xffffff00 | ( v ))
#define LEADING_ONES( v ) __builtin_clz(~( v ) << 24)
#else
#define FSBUFBITS 1
#define FSBUFWIDTH 128
typedef unsigned char fsbuf_type;
#define TRAILING_ZEROS( v ) (( v ) ? 0 : 1)
#define LEADING_ONES( v ) (( v ) ? 1 : 0)
#endif
#endif
#endif
#endif
static const fsbuf_type fsbuf_max = ~(fsbuf_type)0;
static const fsbuf_type fsbuf_high = (fsbuf_type)1 << (FSBUFBITS - 1);
typedef struct freespacegrid
{
fsbuf_type buf[FSHEIGHT][FSBUFWIDTH];
_Bool left_to_right;
_Bool top_to_bottom;
} freespace;
void freespace_dump(freespace* fs)
{
int x, y;
for (y = 0; y < FSHEIGHT; ++y)
{
for (x = 0; x < FSBUFWIDTH; ++x)
{
fsbuf_type i = FSBUFBITS;
fsbuf_type b = fs->buf[y][x];
for(; i != 0; --i, b <<= 1)
putchar(b & fsbuf_high ? '#' : '/');
/*
if (x + 1 < FSBUFWIDTH)
putchar('|');
*/
}
putchar('\n');
}
}
freespace* freespace_new(void)
{
freespace* fs = malloc(sizeof(*fs));
if (!fs)
return 0;
int y;
for (y = 0; y < FSHEIGHT; ++y)
{
memset(&fs->buf[y][0], 0, sizeof(fsbuf_type) * FSBUFWIDTH);
}
fs->left_to_right = true;
fs->top_to_bottom = true;
return fs;
}
void freespace_delete(freespace* fs)
{
if (!fs)
return;
free(fs);
}
/* would be private function: */
void fs_set_buffer( fsbuf_type buf[FSHEIGHT][FSBUFWIDTH],
unsigned x,
unsigned y1,
unsigned xoffset,
unsigned width,
unsigned height)
{
fsbuf_type v;
unsigned y;
for (; width > 0 && x < FSBUFWIDTH; ++x)
{
if (width < xoffset)
v = (((fsbuf_type)1 << width) - 1) << (xoffset - width);
else if (xoffset < FSBUFBITS)
v = ((fsbuf_type)1 << xoffset) - 1;
else
v = fsbuf_max;
for (y = y1; y < y1 + height; ++y)
{
#ifdef FREESPACE_DEBUG
if (buf[y][x] & v)
printf("**** over-writing area ****\n");
#endif
buf[y][x] |= v;
}
if (width < xoffset)
return;
width -= xoffset;
xoffset = FSBUFBITS;
}
}
_Bool freespace_remove( freespace* fs,
unsigned width, unsigned height,
int* resultx, int* resulty)
{
unsigned x, x1, y;
unsigned w, h;
unsigned xoffset, x1offset;
unsigned tz; /* trailing zeros */
fsbuf_type* xptr;
fsbuf_type mask = 0;
fsbuf_type v;
_Bool scanning = false;
_Bool offset = false;
*resultx = -1;
*resulty = -1;
for (y = 0; y < (unsigned) FSHEIGHT - height; ++y)
{
scanning = false;
xptr = &fs->buf[y][0];
for (x = 0; x < FSBUFWIDTH; ++x, ++xptr)
{
if(*xptr == fsbuf_max)
{
scanning = false;
continue;
}
if (!scanning)
{
scanning = true;
x1 = x;
x1offset = xoffset = FSBUFBITS;
w = width;
}
retry:
if (w < xoffset)
mask = (((fsbuf_type)1 << w) - 1) << (xoffset - w);
else if (xoffset < FSBUFBITS)
mask = ((fsbuf_type)1 << xoffset) - 1;
else
mask = fsbuf_max;
offset = false;
for (h = 0; h < height; ++h)
{
v = fs->buf[y + h][x] & mask;
if (v)
{
tz = TRAILING_ZEROS(v);
offset = true;
break;
}
}
if (offset)
{
if (tz)
{
x1 = x;
w = width;
x1offset = xoffset = tz;
goto retry;
}
scanning = false;
}
else
{
if (w <= xoffset) /***** RESULT! *****/
{
fs_set_buffer(fs->buf, x1, y, x1offset, width, height);
*resultx = x1 * FSBUFBITS + (FSBUFBITS - x1offset);
*resulty = y;
return true;
}
w -= xoffset;
xoffset = FSBUFBITS;
}
}
}
return false;
}
int main(int argc, char** argv)
{
int x[1999];
int y[1999];
int w[1999];
int h[1999];
int i;
freespace* fs = freespace_new();
for (i = 0; i < 1999; ++i, ++u)
{
w[i] = rand() % 18 + 4;
h[i] = rand() % 18 + 4;
freespace_remove(fs, w[i], h[i], &x[i], &y[i]);
/*
freespace_dump(fs);
printf("w:%d h:%d x:%d y:%d\n", w[i], h[i], x[i], y[i]);
if (x[i] == -1)
printf("not removed space %d\n", i);
getchar();
*/
}
freespace_dump(fs);
freespace_delete(fs);
return 0;
}
The above code requires one of USE_64BIT_ARRAY, USE_32BIT_ARRAY, USE_16BIT_ARRAY, USE_8BIT_ARRAY to be defined otherwise it will fall back to using only the high bit of an unsigned char for storing the state of grid cells.
The function fs_set_buffer will not be declared in the header, and will become static within the implementation when this code gets split between .h and .c files. A more user friendly function hiding the implementation details will be provided for removing used space from the grid.
Overall, this implementation is faster without optimization than my previous answer with maximum optimization (using GCC on 64bit Gentoo, optimization options -O0 and -O3 respectively).
Regarding USE_NNBIT_ARRAY and the different bit sizes, I used two different methods of timing the code which make 1999 calls to freespace_remove.
Timing main() using the Unix time command (and disabling any output in the code) seemed to prove my expectations correct - that higher bit sizes are faster.
On the other hand, timing individual calls to freespace_remove (using gettimeofday) and comparing the maximum time taken over the 1999 calls seemed to indicate lower bit sizes were faster.
This has only been tested on a 64bit system (Intel Dual Core II).

Related

Fast random/mutation algorithms (vector to vector) [duplicate]

I've been trying to create a generalized Gradient Noise generator (which doesn't use the hash method to get gradients). The code is below:
class GradientNoise {
std::uint64_t m_seed;
std::uniform_int_distribution<std::uint8_t> distribution;
const std::array<glm::vec2, 4> vector_choice = {glm::vec2(1.0, 1.0), glm::vec2(-1.0, 1.0), glm::vec2(1.0, -1.0),
glm::vec2(-1.0, -1.0)};
public:
GradientNoise(uint64_t seed) {
m_seed = seed;
distribution = std::uniform_int_distribution<std::uint8_t>(0, 3);
}
// 0 -> 1
// just passes the value through, origionally was perlin noise activation
double nonLinearActivationFunction(double value) {
//return value * value * value * (value * (value * 6.0 - 15.0) + 10.0);
return value;
}
// 0 -> 1
//cosine interpolation
double interpolate(double a, double b, double t) {
double mu2 = (1 - cos(t * M_PI)) / 2;
return (a * (1 - mu2) + b * mu2);
}
double noise(double x, double y) {
std::mt19937_64 rng;
//first get the bottom left corner associated
// with these coordinates
int corner_x = std::floor(x);
int corner_y = std::floor(y);
// then get the respective distance from that corner
double dist_x = x - corner_x;
double dist_y = y - corner_y;
double corner_0_contrib; // bottom left
double corner_1_contrib; // top left
double corner_2_contrib; // top right
double corner_3_contrib; // bottom right
std::uint64_t s1 = ((std::uint64_t(corner_x) << 32) + std::uint64_t(corner_y) + m_seed);
std::uint64_t s2 = ((std::uint64_t(corner_x) << 32) + std::uint64_t(corner_y + 1) + m_seed);
std::uint64_t s3 = ((std::uint64_t(corner_x + 1) << 32) + std::uint64_t(corner_y + 1) + m_seed);
std::uint64_t s4 = ((std::uint64_t(corner_x + 1) << 32) + std::uint64_t(corner_y) + m_seed);
// each xy pair turns into distance vector from respective corner, corner zero is our starting corner (bottom
// left)
rng.seed(s1);
corner_0_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x, dist_y});
rng.seed(s2);
corner_1_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x, dist_y - 1});
rng.seed(s3);
corner_2_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x - 1, dist_y - 1});
rng.seed(s4);
corner_3_contrib = glm::dot(vector_choice[distribution(rng)], {dist_x - 1, dist_y});
double u = nonLinearActivationFunction(dist_x);
double v = nonLinearActivationFunction(dist_y);
double x_bottom = interpolate(corner_0_contrib, corner_3_contrib, u);
double x_top = interpolate(corner_1_contrib, corner_2_contrib, u);
double total_xy = interpolate(x_bottom, x_top, v);
return total_xy;
}
};
I then generate an OpenGL texture to display with like this:
int width = 1024;
int height = 1024;
unsigned char *temp_texture = new unsigned char[width*height * 4];
double octaves[5] = {2,4,8,16,32};
for( int i = 0; i < height; i++){
for(int j = 0; j < width; j++){
double d_noise = 0;
d_noise += temp_1.noise(j/octaves[0], i/octaves[0]);
d_noise += temp_1.noise(j/octaves[1], i/octaves[1]);
d_noise += temp_1.noise(j/octaves[2], i/octaves[2]);
d_noise += temp_1.noise(j/octaves[3], i/octaves[3]);
d_noise += temp_1.noise(j/octaves[4], i/octaves[4]);
d_noise/=5;
uint8_t noise = static_cast<uint8_t>(((d_noise * 128.0) + 128.0));
temp_texture[j*4 + (i * width * 4) + 0] = (noise);
temp_texture[j*4 + (i * width * 4) + 1] = (noise);
temp_texture[j*4 + (i * width * 4) + 2] = (noise);
temp_texture[j*4 + (i * width * 4) + 3] = (255);
}
}
Which give good results:
But gprof is telling me that the Mersenne twister is taking up 62.4% of my time and growing with larger textures. Nothing else individual takes any where near as much time. While the Mersenne twister is fast after initialization, the fact that I initialize it every time I use it seems to make it pretty slow.
This initialization is 100% required for this to make sure that the same x and y generates the same gradient at each integer point (so you need either a hash function or seed the RNG each time).
I attempted to change the PRNG to both the linear congruential generator and Xorshiftplus, and while both ran orders of magnitude faster, they gave odd results:
LCG (one time, then running 5 times before using)
Xorshiftplus
After one iteration
After 10,000 iterations.
I've tried:
Running the generator several times before utilizing output, this results in slow execution or simply different artifacts.
Using the output of two consecutive runs after initial seed to seed the PRNG again and use the value after wards. No difference in result.
What is happening? What can i do to get faster results that are of the same quality as the mersenne twister?
OK BIG UPDATE:
I don't know why this works, I know it has something to do with the prime number utilized, but after messing around a bit, it appears that the following works:
Step 1, incorporate the x and y values as seeds separately (and incorporate some other offset value or additional seed value with them, this number should be a prime/non trivial factor)
Step 2, Use those two seed results into seeding the generator again back into the function (so like geza said, the seeds made were bad)
Step 3, when getting the result, instead of using modulo number of items (4) trying to get, or & 3, modulo the result by a prime number first then apply & 3. I'm not sure if the prime being a mersenne prime matters or not.
Here is the result with prime = 257 and xorshiftplus being used! (note I used 2048 by 2048 for this one, the others were 256 by 256)
LCG is known to be inadequate for your purpose.
Xorshift128+'s results are bad, because it needs good seeding. And providing good seeding defeats the whole purpose of using it. I don't recommend this.
However, I recommend using an integer hash. For example, one from Bob's page.
Here's a result of the first hash of that page, it looks OK to me, and it is fast (I think it is much faster than Mersenne Twister):
Here's the code I've written to generate this:
#include <cmath>
#include <stdio.h>
unsigned int hash(unsigned int a) {
a = (a ^ 61) ^ (a >> 16);
a = a + (a << 3);
a = a ^ (a >> 4);
a = a * 0x27d4eb2d;
a = a ^ (a >> 15);
return a;
}
unsigned int ivalue(int x, int y) {
return hash(y<<16|x)&0xff;
}
float smooth(float x) {
return 6*x*x*x*x*x - 15*x*x*x*x + 10*x*x*x;
}
float value(float x, float y) {
int ix = floor(x);
int iy = floor(y);
float fx = smooth(x-ix);
float fy = smooth(y-iy);
int v00 = ivalue(iy+0, ix+0);
int v01 = ivalue(iy+0, ix+1);
int v10 = ivalue(iy+1, ix+0);
int v11 = ivalue(iy+1, ix+1);
float v0 = v00*(1-fx) + v01*fx;
float v1 = v10*(1-fx) + v11*fx;
return v0*(1-fy) + v1*fy;
}
unsigned char pic[1024*1024];
int main() {
for (int y=0; y<1024; y++) {
for (int x=0; x<1024; x++) {
float v = 0;
for (int o=0; o<=9; o++) {
v += value(x/64.0f*(1<<o), y/64.0f*(1<<o))/(1<<o);
}
int r = rint(v*0.5f);
pic[y*1024+x] = r;
}
}
FILE *f = fopen("x.pnm", "wb");
fprintf(f, "P5\n1024 1024\n255\n");
fwrite(pic, 1, 1024*1024, f);
fclose(f);
}
If you want to understand, how a hash function work (or better yet, which properties a good hash have), check out Bob's page, for example this.
You (unknowingly?) implemented a visualization of PRNG non-random patterns. That looks very cool!
Except Mersenne Twister, all your tested PRNGs do not seem fit for your purpose. As I have not done further tests myself, I can only suggest to try out and measure further PRNGs.
The randomness of LCGs are known to be sensitive to the choice of their parameters. In particular, the period of a LCG is relative to the m parameter - at most it will be m (your prime factor) & for many values it can be less.
Similarly, the careful parameters selection is required to get a long period from Xorshift PRNGs.
You've noted that some PRNGs give good procedural generation results while other do not. In order to isolate the cause, I would factor out the proc gen stuff & examine the PRNG output directly. An easy way to visualize the data is to build a grey scale image where each pixel value is a (possibly scaled) random value. For image based stuff, I find this to be an easy way to find stuff that may lead to visual artifacts. Any artifacts you see with this are likely to cause issues with your proc gen output.
Another option is to try something like the Diehard tests. If the aforementioned image test failed to reveal any problems, I might use this just to be sure my PRNG techniques were trustworthy.
Note that your code seeds the PRNG, then generates one pseudorandom number from the PRNG. The reason for the nonrandomness in xorshift128+ that you discovered is that xorshift128+ simply adds the two halves of the seed (and uses the result mod 264 as the generated number) before changing its state (review its source code). This makes that PRNG considerably different from a hash function.
What you see is the practical demonstration of quality of PRNG. Mersenne Twister is one of the best PRNGs with good performance, it passes DIEHARD tests. One should know that generating a random numbers is not an easy computational task, so looking for a better performance will inevitably result in poor quality. LCG is known to be simplest and worst PRNG ever designed and it clearly shows two-dimensional correlation as in your picture. The quality of Xorshift generators largely depend on bitness and parameters. They are definitely worse than Mersenne Twister, but some (xorshift128+) may work good enough to pass BigCrush battery of TestU01 tests.
In other words, if you are making an important physical modelling numerical experiment, you better continue to use Mersenne Twister as known to be a good trade-off between speed and quality and it comes in many standard libraries. On a less important case you may try to use xorshift128+ generator. For an ultimate results you need to use cryptographical-quality PRNG (none of mentioned here may be used for cryptographical purposes).

How do I allocate memory and copy 2D arrays between CPU / GPU in CUDA without flattening them?

So I want to allocate 2D arrays and also copy them between the CPU and GPU in CUDA, but I am a total beginner and other online materials are very difficult for me to understand or are incomplete. It is important that I am able to access them as a 2D array in the kernel code as shown below.
Note that height != width for the arrays, that's something that further confuses me if it's possible as I always struggle choosing grid size.
I've considered flattening them, but I really want to get it working this way.
This is how far I've got by my own research.
__global__ void myKernel(int *firstArray, int *secondArray, int rows, int columns) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
int column = blockIdx.y * blockDim.y + threadIdx.y;
if (row >= rows || column >= columns)
return;
// Do something with the arrays like you would on a CPU, like:
firstArray[row][column] = row * 2;
secondArray[row[column] = row * 3;
}
int main() {
int rows = 300, columns = 200;
int h_firstArray[rows][columns], h_secondArray[rows][columns];
int *d_firstArray[rows][columns], *d_secondArray[rows][columns];
// populate h_ arrays (Can do this bit myself)
// Allocate memory on device, no idea how to do for 2D arrays.
// Do memcopies to GPU, no idea how to do for 2D arrays.
dim3 block(rows,columns);
dim3 grid (1,1);
myKernel<<<grid,block>>>(d_firstArray, d_secondArray, rows, columns);
// Do memcopies back to host, no idea how to do for 2D arrays.
cudaFree(d_firstArray);
cudaFree(d_secondArray);
return 0;
}
EDIT: I was asked if the array width will be known at compile time in the problems I would try to solve. You can assume it is as I'm interested primarily in this particular situation as of now.
In the general case (array dimensions not known until runtime), handling doubly-subscripted access in CUDA device code requires an array of pointers, just as it does in host code. C and C++ handle each subscript as a pointer dereference, in order to reach the final location in the "2D array".
Double-pointer/doubly-subscripted access in device code in the general case is already covered in the canonical answer linked from the cuda tag info page. There are several drawbacks to this, which are covered in that answer so I won't repeat them here.
However, if the array width is known at compile time (array height can be dynamic - i.e. determined at runtime), then we can leverage the compiler and the language typing mechanisms to allow us to circumvent most of the drawbacks. Your code demonstrates several other incorrect patterns for CUDA and/or C/C++ usage:
Passing an item for doubly-subscripted access to a C or C++ function cannot be done with a simple single pointer type like int *firstarray
Allocating large host arrays via stack-based mechanisms:
int h_firstArray[rows][columns], h_secondArray[rows][columns];
is often problematic in C and C++. These are stack based variables and will often run into stack limits if large enough.
CUDA threadblocks are limited to 1024 threads total. Therefore such a threadblock dimension:
dim3 block(rows,columns);
will not work except for very small sizes of rows and columns (the product must be less than or equal to 1024).
When declaring pointer variables for a device array in CUDA, it is almost never correct to create arrays of pointers:
int *d_firstArray[rows][columns], *d_secondArray[rows][columns];
nor do we allocate space on the host, then "reallocate" those pointers for device usage.
What follows is a worked example with the above items addressed and demonstrating the aforementioned method where the array width is known at runtime:
$ cat t50.cu
#include <stdio.h>
const int array_width = 200;
typedef int my_arr[array_width];
__global__ void myKernel(my_arr *firstArray, my_arr *secondArray, int rows, int columns) {
int column = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (row >= rows || column >= columns)
return;
// Do something with the arrays like you would on a CPU, like:
firstArray[row][column] = row * 2;
secondArray[row][column] = row * 3;
}
int main() {
int rows = 300, columns = array_width;
my_arr *h_firstArray, *h_secondArray;
my_arr *d_firstArray, *d_secondArray;
size_t dsize = rows*columns*sizeof(int);
h_firstArray = (my_arr *)malloc(dsize);
h_secondArray = (my_arr *)malloc(dsize);
// populate h_ arrays
memset(h_firstArray, 0, dsize);
memset(h_secondArray, 0, dsize);
// Allocate memory on device
cudaMalloc(&d_firstArray, dsize);
cudaMalloc(&d_secondArray, dsize);
// Do memcopies to GPU
cudaMemcpy(d_firstArray, h_firstArray, dsize, cudaMemcpyHostToDevice);
cudaMemcpy(d_secondArray, h_secondArray, dsize, cudaMemcpyHostToDevice);
dim3 block(32,32);
dim3 grid ((columns+block.x-1)/block.x,(rows+block.y-1)/block.y);
myKernel<<<grid,block>>>(d_firstArray, d_secondArray, rows, columns);
// Do memcopies back to host
cudaMemcpy(h_firstArray, d_firstArray, dsize, cudaMemcpyDeviceToHost);
cudaMemcpy(h_secondArray, d_secondArray, dsize, cudaMemcpyDeviceToHost);
// validate
if (cudaGetLastError() != cudaSuccess) {printf("cuda error\n"); return -1;}
for (int i = 0; i < rows; i++)
for (int j = 0; j < columns; j++){
if (h_firstArray[i][j] != i*2) {printf("first mismatch at %d,%d, was: %d, should be: %d\n", i,j,h_firstArray[i][j], i*2); return -1;}
if (h_secondArray[i][j] != i*3) {printf("second mismatch at %d,%d, was: %d, should be: %d\n", i,j,h_secondArray[i][j], i*3); return -1;}}
printf("success!\n");
cudaFree(d_firstArray);
cudaFree(d_secondArray);
return 0;
}
$ nvcc -arch=sm_61 -o t50 t50.cu
$ cuda-memcheck ./t50
========= CUDA-MEMCHECK
success!
========= ERROR SUMMARY: 0 errors
$
I've reversed the sense of your kernel indexing (x,y) to help with coalesced global memory access. We see that with this kind of type creation, we can leverage the compiler and the language features to end up with a code that allows for doubly-subscripted access in both host and device code, while otherwise allowing CUDA operations (e.g. cudaMemcpy) as if we are dealing with single-pointer (e.g. "flattened") arrays.

Covering segments by points

I did search and looked at these below links but it didn't help .
Point covering problem
Segments poked (covered) with points - any tricky test cases?
Need effective greedy for covering a line segment
Problem Description:
You are given a set of segments on a line and your goal is to mark as
few points on a line as possible so that each segment contains at least
one marked point
Task.
Given a set of n segments {[a0,b0],[a1,b1]....[an-1,bn-1]} with integer
coordinates on a line, find the minimum number 'm' of points such that
each segment contains at least one point .That is, find a set of
integers X of the minimum size such that for any segment [ai,bi] there
is a point x belongs X such that ai <= x <= bi
Output Description:
Output the minimum number m of points on the first line and the integer
coordinates of m points (separated by spaces) on the second line
Sample Input - I
3
1 3
2 5
3 6
Output - I
1
3
Sample Input - II
4
4 7
1 3
2 5
5 6
Output - II
2
3 6
I didn't understand the question itself. I need the explanation, on how to solve this above problem, but i don't want the code. Examples would be greatly helpful
Maybe this formulation of the problem will be easier to understand. You have n people who can each tolerate a different range of temperatures [ai, bi]. You want to find the minimum number of rooms to make them all happy, i.e. you can set each room to a certain temperature so that each person can find a room within his/her temperature range.
As for how to solve the problem, you said you didn't want code, so I'll just roughly describe an approach. Think about the coldest room you have. If making it one degree warmer won't cause anyone to no longer be able to tolerate that room, you might as well make the increase, since that can only allow more people to use that room. So the first temperature you should set is the warmest one that the most cold-loving person can still tolerate. In other words, it should be the smallest of the bi. Now this room will satisfy some subset of your people, so you can remove them from consideration. Then repeat the process on the remaining people.
Now, to implement this efficiently, you might not want to literally do what I said above. I suggest sorting the people according to bi first, and for the ith person, try to use an existing room to satisfy them. If you can't, try to create a new one with the highest temperature possible to satisfy them, which is bi.
Yes the description is pretty vague and the only meaning that makes sense to me is this:
You got some line
Segment on a line is defined by l,r
Where one parameter is distance from start of line and second is the segments length. Which one is which is hard to tell as the letters are not very usual for such description. My bet is:
l length of segment
r distance of (start?) of segment from start of line
You want to find min set of points
So that each segment has at least one point in it. That mean for 2 overlapped segments you need just one point ...
Surely there are more option how to solve this, the obvious is genere & test with some heuristics like genere combinations only for segments that are overlapped more then once. So I would attack this task in this manner (using assumed terminology from #2):
sort segments by r
add number of overlaps to your segment set data
so the segment will be { r,l,n } and set the n=0 for all segments for now.
scan segments for overlaps
something like
for (i=0;i<segments;i++) // loop all segments
for (j=i+1;j<segments;j++) // loop all latter segments until they are still overlapped
if ( segment[i] and segment [j] are overlapped )
{
segment[i].n++; // update overlap counters
segment[j].n++;
}
else break;
Now if the r-sorted segments are overlapped then
segment[i].r <=segment[j].r
segment[i].r+segment[i].l>=segment[j].r
scan segments handling non overlapped segments
for each segment such that segment[i].n==0 add to the solution point list its point (middle) defined by distance from start of line.
points.add(segment[i].r+0.5*segment[i].l);
And after that remove segment from the list (or tag it as used or what ever you do for speed boost...).
scan segments that are overlapped just once
So if segment[i].n==1 then you need to determine if it is overlapped with i-1 or i+1. So add the mid point of the overlap to the solution points and remove i segment from list. Then decrement the n of the overlapped segment (i+1 or i-1)` and if zero remove it too.
points.add(0.5*( segment[j].r + min(segment[i].r+segment[i].l , segment[j].r+segment[j].l )));
Loop this whole scanning until there is no new point added to the solution.
now you got only multiple overlaps left
From this point I will be a bit vague for 2 reasons:
I do not have this tested and I d not have any test data to validate not to mention I am lazy.
This smells like assignment so there is some work/fun left for you.
From start I would scann all segments and remove all of them which got any point from the solution inside. This step you should perform after any changes in the solution.
Now you can experiment with generating combination of points for each overlapped group of segments and remember the minimal number of points covering all segments in group. (simply by brute force).
There are more heuristics possible like handling all twice overlapped segments (in similar manner as the single overlaps) but in the end you will have to do brute force on the rest of data ...
[edit1] as you added new info
The r,l means distance of left and right from the start of line. So if you want to convert between the other formulation { r',l' } and (l<=r) then
l=r`
r=r`+l`
and back
r`=l
l`=r-l`
Sorry too lazy to rewrite the whole thing ...
Here is the working solution in C, please refer to it partially and try to fix your code before reading the whole. Happy coding :) Spoiler alert
#include <stdio.h>
#include <stdlib.h>
int cmp_func(const void *ptr_a, const void *ptr_b)
{
const long *a = *(double **)ptr_a;
const long *b = *(double **)ptr_b;
if (a[1] == b[1])
return a[0] - b[0];
return a[1] - b[1];
}
int main()
{
int i, j, n, num_val;
long **arr;
scanf("%d", &n);
long values[n];
arr = malloc(n * sizeof(long *));
for (i = 0; i < n; ++i) {
*(arr + i) = malloc(2 * sizeof(long));
scanf("%ld %ld", &arr[i][0], &arr[i][1]);
}
qsort(arr, n, sizeof(long *), cmp_func);
i = j = 0;
num_val = 0;
while (i < n) {
int skip = 0;
values[num_val] = arr[i][1];
for (j = i + 1; j < n; ++j) {
int condition;
condition = arr[i][1] <= arr[j][1] ? arr[j][0] <= arr[i][1] : 0;
if (condition) {
skip++;
} else {
break;
}
}
num_val++;
i += skip + 1;
}
printf("%d\n", num_val);
for (int k = 0; k < num_val; ++k) {
printf("%ld ", values[k]);
}
free(arr);
return 0;
}
Here's the working code in C++ for anyone searching :)
#include <bits/stdc++.h>
#define ll long long
#define double long double
#define vi vector<int>
#define endl "\n"
#define ff first
#define ss second
#define pb push_back
#define all(x) (x).begin(),(x).end()
#define mp make_pair
using namespace std;
bool cmp(const pair<ll,ll> &a, const pair<ll,ll> &b)
{
return (a.second < b.second);
}
vector<ll> MinSig(vector<pair<ll,ll>>&vec)
{
vector<ll> points;
for(int x=0;x<vec.size()-1;)
{
bool found=false;
points.pb(vec[x].ss);
for(int y=x+1;y<vec.size();y++)
{
if(vec[y].ff>vec[x].ss)
{
x=y;
found=true;
break;
}
}
if(!found)
break;
}
return points;
}
int main()
{
ios_base::sync_with_stdio(false);
cin.tie(NULL);
int n;
cin>>n;
vector<pair<ll,ll>>v;
for(int x=0;x<n;x++)
{
ll temp1,temp2;
cin>>temp1>>temp2;
v.pb(mp(temp1,temp2));
}
sort(v.begin(),v.end(),cmp);
vector<ll>res=MinSig(v);
cout<<res.size()<<endl;
for(auto it:res)
cout<<it<<" ";
}

Parallel multiplication of many small matrices by fixed vector

Situation is the following: I have a number (1000s) of elements which are given by small matrices of dimensions 4x2, 9x3 ... you get the idea. All matrices have the same dimension.
I want to multiply each of these matrices with a fixed vector of precalculated values. In short:
for(i = 1...n)
X[i] = M[i] . N;
What is the best approach to do this in parallel using Thrust? How do I lay out my data in memory?
NB: There might be specialized, more suitable libraries to do this on GPUs. I'm interested in Thrust because it allows me to deploy to different backends, not just CUDA.
One possible approach:
flatten the arrays (matrices) into a single data vector. This is an advantageous step for enabling general thrust processing anyway.
use a strided range mechanism to take your scaling vector and extend it to the overall length of your flattened data vector
use thrust::transform with thrust::multiplies to multiply the two vectors together.
If you need to access the matrices later out of your flattened data vector (or result vector), you can do so with pointer arithmetic, or a combination of fancy iterators.
If you need to re-use the extended scaling vector, you may want to use the method outlined in step 2 exactly (i.e. create an actual vector using that method, length = N matrices, repeated). If you are only doing this once, you can achieve the same effect with a counting iterator, followed by a transform iterator (modulo the length of your matrix in elements), followed by a permutation iterator, to index into your original scaling vector (length = 1 matrix).
The following example implements the above, without using the strided range iterator method:
#include <iostream>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/functional.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/transform.h>
#define N_MAT 1000
#define H_MAT 4
#define W_MAT 3
#define RANGE 1024
struct my_modulo_functor : public thrust::unary_function<int, int>
{
__host__ __device__
int operator() (int idx) {
return idx%(H_MAT*W_MAT);}
};
int main(){
thrust::host_vector<int> data(N_MAT*H_MAT*W_MAT);
thrust::host_vector<int> scale(H_MAT*W_MAT);
// synthetic; instead flatten/copy matrices into data vector
for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++) data[i] = rand()%RANGE;
for (int i = 0; i < H_MAT*W_MAT; i++) scale[i] = rand()%RANGE;
thrust::device_vector<int> d_data = data;
thrust::device_vector<int> d_scale = scale;
thrust::device_vector<int> d_result(N_MAT*H_MAT*W_MAT);
thrust::transform(d_data.begin(), d_data.end(), thrust::make_permutation_iterator(d_scale.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), my_modulo_functor())) ,d_result.begin(), thrust::multiplies<int>());
thrust::host_vector<int> result = d_result;
for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++)
if (result[i] != data[i] * scale[i%(H_MAT*W_MAT)]) {std::cout << "Mismatch at: " << i << " cpu result: " << (data[i] * scale[i%(H_MAT*W_MAT)]) << " gpu result: " << result[i] << std::endl; return 1;}
std::cout << "Success!" << std::endl;
return 0;
}
EDIT: Responding to a question below:
The benefit of fancy iterators (i.e. transform(numbers, iterator)) is that they often allow for eliminaion of extra data copies/data movement, as compared to assembling other number (which requires extra steps and data movement) and then passing it to transform(numbers, other numbers). If you're only going to use other numbers once, then the fancy iterators will generally be better. If you're going to use other numbers again, then you may want to assemble it explicitly. This preso is instructive, in particular "Fusion".
For a one-time use of other numbers the overhead of assembling it on the fly using fancy iterators and the functor is generally lower than explicitly creating a new vector, and then passing that new vector to the transform routine.
When looking for a software library which is concisely made for multiplying small matrices, then one may have a look at https://github.com/hfp/libxsmm. Below, the code requests a specialized matrix kernel according to the typical GEMM parameters (please note that some limitations apply).
double alpha = 1, beta = 1;
const char transa = 'N', transb = 'N';
int flags = LIBXSMM_GEMM_FLAGS(transa, transb);
int prefetch = LIBXSMM_PREFETCH_AUTO;
libxsmm_blasint m = 23, n = 23, k = 23;
libxsmm_dmmfunction xmm = NULL;
xmm = libxsmm_dmmdispatch(m, n, k,
&m/*lda*/, &k/*ldb*/, &m/*ldc*/,
&alpha, &beta, &flags, &prefetch);
Given the above code, one can proceed and run "xmm" for an entire series of (small) matrices without a particular data structure (below code also uses "prefetch locations").
if (0 < n) { /* check that n is at least 1 */
# pragma parallel omp private(i)
for (i = 0; i < (n - 1); ++i) {
const double *const ai = a + i * asize;
const double *const bi = b + i * bsize;
double *const ci = c + i * csize;
xmm(ai, bi, ci, ai + asize, bi + bsize, ci + csize);
}
xmm(a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize,
/* pseudo prefetch for last element of batch (avoids page fault) */
a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize);
}
In addition to the manual loop control as shown above, libxsmm_gemm_batch (or libxsmm_gemm_batch_omp) can be used (see ReadTheDocs). The latter is useful if data structures exist that describe the series of operands (A, B, and C matrices).
There are two reasons why this library gives superior performance: (1) on-the-fly code specialization using an in-memory code generation technique, and (2) loading the next matrix operands while calculating the current product.
( Given one is looking for something that blends well with C/C++, this library supports it. However, it does not aim for CUDA/Thrust. )

Benchmark problems for testing concurency

For one of the projects I'm doing right now, I need to look at the performance (amongst other things) of different concurrent enabled programming languages.
At the moment I'm looking into comparing stackless python and C++ PThreads, so the focus is on these two languages, but other languages will probably be tested later. Ofcourse the comparison must be as representative and accurate as possible, so my first thought was to start looking for some standard concurrent/multi-threaded benchmark problems, alas I couldn't find any decent or standard, tests/problems/benchmarks.
So my question is as follows: Do you have a suggestion for a good, easy or quick problem to test the performance of the programming language (and to expose it's strong and weak points in the process)?
Surely you should be testing hardware and compilers rather than a language for concurrency performance?
I would be looking at a language from the point of view of how easy and productive it is in terms of concurrency and how much it 'insulates' the programmer from making locking mistakes.
EDIT: from past experience as a researcher designing parallel algorithms, I think you will find in most cases the concurrent performance will depend largely on how an algorithm is parallelised, and how it targets the underlying hardware.
Also, benchmarks are notoriously unequal; this is even more so in a parallel environment. For instance, a benchmark that 'crunches' very large matrices would be suited to a vector pipeline processor, whereas a parallel sort might be better suited to more general purpose multi core CPUs.
These might be useful:
Parallel Benchmarks
NAS Parallel Benchmarks
Well, there are a few classics, but different tests emphasize different features. Some distributed systems may be more robust, have more efficient message-passing, etc. Higher message overhead can hurt scalability, since it the normal way to scale up to more machines is to send a larger number of small messages. Some classic problems you can try are a distributed Sieve of Eratosthenes or a poorly implemented fibonacci sequence calculator (i.e. to calculate the 8th number in the series, spin of a machine for the 7th, and another for the 6th). Pretty much any divide-and-conquer algorithm can be done concurrently. You could also do a concurrent implementation of Conway's game of life or heat transfer. Note that all of these algorithms have different focuses and thus you probably will not get one distributed system doing the best in all of them.
I'd say the easiest one to implement quickly is the poorly implemented fibonacci calculator, though it places too much emphasis on creating threads and too little on communication between those threads.
Surely you should be testing hardware
and compilers rather than a language
for concurrency performance?
No, hardware and compilers are irrelevant for my testing purposes. I'm just looking for some good problems that can test how well code, written in one language, can compete against code from another language. I'm really testing the constructs available in the specific languages to do concurrent programming. And one of the criteria is performance (measured in time).
Some of the other test criteria I'm looking for are:
how easy is it to write correct code; because as we all know concurrent programming is harder then writing single threaded programs
what is the technique used to to concurrent programming: event-driven, actor based, message parsing, ...
how much code must be written by the programmer himself and how much is done automatically for him: this can also be tested with the given benchmark problems
what's the level of abstraction and how much overhead is involved when translated back to machine code
So actually, I'm not looking for performance as the only and best parameter (which would indeed send me to the hardware and the compilers instead of the language itself), I'm actually looking from a programmers point of view to check what language is best suited for what kind of problems, what it's weaknesses and strengths are and so on...
Bare in mind that this is just a small project and the tests are therefore to be kept small as well. (rigorous testing of everything is therefore not feasible)
I have decided to use the Mandelbrot set (the escape time algorithm to be more precise) to benchmark the different languages.
It fits me quite well as the original algorithm can easily be implemented and creating the multi threaded variant from it is not that much work.
below is the code I currently have. It is still a single threaded variant, but I'll update it as soon as I'm satisfied with the result.
#include <cstdlib> //for atoi
#include <iostream>
#include <iomanip> //for setw and setfill
#include <vector>
int DoThread(const double x, const double y, int maxiter) {
double curX,curY,xSquare,ySquare;
int i;
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++) {
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
return i;
}
void SingleThreaded(int horizPixels, int vertPixels, int maxiter, std::vector<std::vector<int> >& result) {
for(int x = horizPixels; x > 0; x--) {
for(int y = vertPixels; y > 0; y--) {
//3.0 -> so we always have -1.5 -> 1.5 as the window; (x - (horizPixels / 2) will go from -horizPixels/2 to +horizPixels/2
result[x-1][y-1] = DoThread((3.0 / horizPixels) * (x - (horizPixels / 2)),(3.0 / vertPixels) * (y - (vertPixels / 2)),maxiter);
}
}
}
int main(int argc, char* argv[]) {
//first arg = length along horizontal axis
int horizPixels = atoi(argv[1]);
//second arg = length along vertical axis
int vertPixels = atoi(argv[2]);
//third arg = iterations
int maxiter = atoi(argv[3]);
//fourth arg = threads
int threadCount = atoi(argv[4]);
std::vector<std::vector<int> > result(horizPixels, std::vector<int>(vertPixels,0)); //create and init 2-dimensional vector
SingleThreaded(horizPixels, vertPixels, maxiter, result);
//TODO: remove these lines
for(int y = 0; y < vertPixels; y++) {
for(int x = 0; x < horizPixels; x++) {
std::cout << std::setw(2) << std::setfill('0') << std::hex << result[x][y] << " ";
}
std::cout << std::endl;
}
}
I've tested it with gcc under Linux, but I'm sure it works under other compilers/Operating Systems as well. To get it to work you have to enter some command line arguments like so:
mandelbrot 106 500 255 1
the first argument is the width (x-axis)
the second argument is the height (y-axis)
the third argument is the number of maximum iterations (the number of colors)
the last ons is the number of threads (but that one is currently not used)
on my resolution, the above example gives me a nice ASCII-art representation of a Mandelbrot set. But try it for yourself with different arguments (the first one will be the most important one, as that will be the width)
Below you can find the code I hacked together to test the multi threaded performance of pthreads. I haven't cleaned it up and no optimizations have been made; so the code is a bit raw.
the code to save the calculated mandelbrot set as a bitmap is not mine, you can find it here
#include <cstdlib> //for atoi
#include <iostream>
#include <iomanip> //for setw and setfill
#include <vector>
#include "bitmap_Image.h" //for saving the mandelbrot as a bmp
#include <pthread.h>
pthread_mutex_t mutexCounter;
int sharedCounter(0);
int percent(0);
int horizPixels(0);
int vertPixels(0);
int maxiter(0);
//doesn't need to be locked
std::vector<std::vector<int> > result; //create 2 dimensional vector
void *DoThread(void *null) {
double curX,curY,xSquare,ySquare,x,y;
int i, intx, inty, counter;
counter = 0;
do {
counter++;
pthread_mutex_lock (&mutexCounter); //lock
intx = int((sharedCounter / vertPixels) + 0.5);
inty = sharedCounter % vertPixels;
sharedCounter++;
pthread_mutex_unlock (&mutexCounter); //unlock
//exit thread when finished
if (intx >= horizPixels) {
std::cout << "exited thread - I did " << counter << " calculations" << std::endl;
pthread_exit((void*) 0);
}
//set x and y to the correct value now -> in the range like singlethread
x = (3.0 / horizPixels) * (intx - (horizPixels / 1.5));
y = (3.0 / vertPixels) * (inty - (vertPixels / 2));
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++){
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
result[intx][inty] = i;
} while (true);
}
int DoSingleThread(const double x, const double y) {
double curX,curY,xSquare,ySquare;
int i;
curX = x + x*x - y*y;
curY = y + x*y + x*y;
ySquare = curY*curY;
xSquare = curX*curX;
for (i=0; i<maxiter && ySquare + xSquare < 4;i++){
ySquare = curY*curY;
xSquare = curX*curX;
curY = y + curX*curY + curX*curY;
curX = x - ySquare + xSquare;
}
return i;
}
void SingleThreaded(std::vector<std::vector<int> >& result) {
for(int x = horizPixels - 1; x != -1; x--) {
for(int y = vertPixels - 1; y != -1; y--) {
//3.0 -> so we always have -1.5 -> 1.5 as the window; (x - (horizPixels / 2) will go from -horizPixels/2 to +horizPixels/2
result[x][y] = DoSingleThread((3.0 / horizPixels) * (x - (horizPixels / 1.5)),(3.0 / vertPixels) * (y - (vertPixels / 2)));
}
}
}
void MultiThreaded(int threadCount, std::vector<std::vector<int> >& result) {
/* Initialize and set thread detached attribute */
pthread_t thread[threadCount];
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (int i = 0; i < threadCount - 1; i++) {
pthread_create(&thread[i], &attr, DoThread, NULL);
}
std::cout << "all threads created" << std::endl;
for(int i = 0; i < threadCount - 1; i++) {
pthread_join(thread[i], NULL);
}
std::cout << "all threads joined" << std::endl;
}
int main(int argc, char* argv[]) {
//first arg = length along horizontal axis
horizPixels = atoi(argv[1]);
//second arg = length along vertical axis
vertPixels = atoi(argv[2]);
//third arg = iterations
maxiter = atoi(argv[3]);
//fourth arg = threads
int threadCount = atoi(argv[4]);
result = std::vector<std::vector<int> >(horizPixels, std::vector<int>(vertPixels,21)); // init 2-dimensional vector
if (threadCount <= 1) {
SingleThreaded(result);
} else {
MultiThreaded(threadCount, result);
}
//TODO: remove these lines
bitmapImage image(horizPixels, vertPixels);
for(int y = 0; y < vertPixels; y++) {
for(int x = 0; x < horizPixels; x++) {
image.setPixelRGB(x,y,16777216*result[x][y]/maxiter % 256, 65536*result[x][y]/maxiter % 256, 256*result[x][y]/maxiter % 256);
//std::cout << std::setw(2) << std::setfill('0') << std::hex << result[x][y] << " ";
}
std::cout << std::endl;
}
image.saveToBitmapFile("~/Desktop/test.bmp",32);
}
good results can be obtained using the program with the following arguments:
mandelbrot 5120 3840 256 3
that way you will get an image that is 5 * 1024 wide; 5 * 768 high with 256 colors (alas you will only get 1 or 2) and 3 threads (1 main thread that doesn't do any work except creating the worker threads, and 2 worker threads)
Since the benchmarks game moved to a quad-core machine September 2008, many programs in different programming languages have been re-written to exploit quad-core - for example, the first 10 mandelbrot programs.

Resources