OpenCL crash when calling finish() - macos

I am writing an OpenCL app on mac using c++, and it crashes in certain cases depending on the work size.
The program crashes due to a SIGABRT.
Is there any way to get more information about the error?
Why is SIGABRT being raised? Can I catch it?
EDIT:
I realize that this program is a doozie, however I will try to explain it in case anyone would like to take a stab at it.
Through debugging I discovered that the cause of the SIGABRT was one of the kernels timing out.
The program is a tile-based 3D renderer. It is an OpenCL implementation of this algorithm: https://github.com/ssloy/tinyrenderer
The screen is divided into 8x8 tiles. One of the kernels (the tiler) computes which polygons overlap each tile, storing the results in a data structure called tilePolys. A subsequent kernel (the rasterizer), which runs one work item per tile, iterates over the list of polys occupying the tile and rasterizes them.
The tiler writes to an integer buffer which is a list of lists of polygon indices. Each list is of a fixed size (polysPerTile + 1 for the count) where the first element is the count and the subsequent polysPerTile elements are indices of polygons in the tile. There is one such list per tile.
For some reason in certain cases the tiler writes a very large poly count (13172746) to one of the tile's lists in tilePolys. This causes the rasterizer to loop for a long time and time out.
The strange thing is that the index to which the large count is written is never accessed by the tiler.
The code for the tiler kernel is below:
// this kernel is executed once per polygon
// it computes which tiles are occupied by the polygon and adds the index of the polygon to the list for that tile
kernel void tiler(
// number of polygons
ulong nTris,
// width of screen
int width,
// height of screen
int height,
// number of tiles in x direction
int tilesX,
// number of tiles in y direction
int tilesY,
// number of pixels per tile (tiles are square)
int tileSize,
// size of the polygon list for each tile
int polysPerTile,
// 4x4 matrix representing the viewport
global const float4* viewport,
// vertex positions
global const float* vertices,
// indices of vertices
global const int* indices,
// array of array-lists of polygons per tile
// structure of list is an int representing the number of polygons covering that tile,
// followed by [polysPerTile] integers representing the indices of the polygons in that tile
// there are [tilesX*tilesY] such arraylists
volatile global int* tilePolys)
{
size_t faceInd = get_global_id(0);
// compute vertex position in viewport space
float3 vs[3];
for(int i = 0; i < 3; i++) {
// indices are vertex/uv/normal
int vertInd = indices[faceInd*9+i*3];
float4 vertHomo = (float4)(vertices[vertInd*4], vertices[vertInd*4+1], vertices[vertInd*4+2], vertices[vertInd*4+3]);
vertHomo = vec4_mul_mat4(vertHomo, viewport);
vs[i] = vertHomo.xyz / vertHomo.w;
}
float2 bboxmin = (float2)(INFINITY,INFINITY);
float2 bboxmax = (float2)(-INFINITY,-INFINITY);
// size of screen
float2 clampCoords = (float2)(width-1, height-1);
// compute bounding box of triangle in screen space
for (int i=0; i<3; i++) {
for (int j=0; j<2; j++) {
bboxmin[j] = max(0.f, min(bboxmin[j], vs[i][j]));
bboxmax[j] = min(clampCoords[j], max(bboxmax[j], vs[i][j]));
}
}
// transform bounding box to tile space
int2 tilebboxmin = (int2)(bboxmin[0] / tileSize, bboxmin[1] / tileSize);
int2 tilebboxmax = (int2)(bboxmax[0] / tileSize, bboxmax[1] / tileSize);
// loop over all tiles in bounding box
for(int x = tilebboxmin[0]; x <= tilebboxmax[0]; x++) {
for(int y = tilebboxmin[1]; y <= tilebboxmax[1]; y++) {
// get index of tile
int tileInd = y * tilesX + x;
// get start index of polygon list for this tile
int counterInd = tileInd * (polysPerTile + 1);
// get current number of polygons in list
int numPolys = atomic_inc(&tilePolys[counterInd]);
// if list is full, skip tile
if(numPolys >= polysPerTile) {
// decrement the count because we will not add to the list
atomic_dec(&tilePolys[counterInd]);
} else {
// otherwise add the poly to the list
// the index is the offset + numPolys + 1 as tilePolys[counterInd] holds the poly count
int ind = counterInd + numPolys + 1;
tilePolys[ind] = (int)(faceInd);
}
}
}
}
My theories are that either:
I have incorrectly implemented the atomic functions for reading and incrementing the count
I am using an incorrect number format causing garbage to be written into tilePolys
One of my other kernels is inadvertently writing into the tilePolys buffer
I do not think it is the last one though because if instead of writing faceInd to tilePolys, I write a constant value, the large poly count disappears.
tilePolys[counterInd+numPolys+1] = (int)(faceInd); // this is the problem line
tilePolys[counterInd+numPolys+1] = (int)(5); // this fixes the issue

It looks like your kernel is crashing on the GPU itself. You can't really get any extra diagnostics about that directly, at least not on macOS. You'll need to start narrowing down the problem. Some suggestions:
As the crash is currently happening in clFinish() you don't know what asynchronous command is causing the crash. Try switching all your enqueue calls to blocking mode. This should cause it to crash in the call that's actually going wrong.
Check return/error codes on all OpenCL API calls. Sometimes, ignoring an error from an earlier call can cause problems in a later call which relies on earlier results. For example, if creating a buffer fails, passing the result of that buffer creation as a kernel argument will cause problems when trying to run the kernel.
The most likely reason for the crash is that your OpenCL kernel is accessing memory out of bounds or is otherwise misusing pointers. Re-check any array index calculations.
Check if the problem occurs with smaller work batches. Scale up from one workgroup (or work item if not using groups) and see if it only occurs beyond a certain work size. This may give you a clue about buffer sizes and array indices that might be causing the crash.
Systematically comment out parts of your kernel. If the crash goes away if you comment out a specific piece of code, there's a good chance the problem is in that code.
If you've narrowed the problem down to a small area of code but can't work out where it's coming from, start recording diagnostic output to check that variables have the values you're expecting.
Without seeing any code, I can't give you any more specific advice than that.
Note that OpenCL is deprecated on macOS, so if you're specifically targeting that platform and don't need to support Linux, Windows, etc. I recommend learning Metal Compute instead. Apple has made it clear that this is the GPU programming platform they want to support, and the tooling for it is already much better than their OpenCL tooling ever was.
I suspect Apple will eventually stop implementing OpenCL support when they release a Mac with a new type of GPU, so even if you're targeting the Mac as well as other platforms, you will probably need to switch to Metal on the Mac somewhere down the line anyway. As of macOS 10.14, the minimum system requirements of the OS already include a Metal-capable GPU, so you only need OpenCL as a fallback if you wish to support all Mac models able to run 10.13 or an even older OS version.

Related

DirectX 11 compute shader for ray/mesh intersect

I recently converted a DirectX 9 application that was using D3DXIntersect to find ray/mesh intersections to DirectX 11. Since D3DXIntersect is not available in DX11, I wrote my own code to find the intersection, which just loops over all the triangles in the mesh and tests them, keeping track of the closest hit to the origin. This is done on the CPU side and works fine for picking via the GUI, but I have another part of the application that creates a new mesh from an existing one based on several different viewpoints, and I need to check line of sight for every triangle in the mesh many times. This gets pretty slow.
Does it make sense to use a DX11 compute shader to do this (i.e. would there be a significant speedup from doing it on the CPU)? I searched the internet but could not find an existing example.
Assuming the answer is yes, here is the approach I am thinking of:
Launch a thread for every triangle in my mesh
Each thread computes the distance to a hit on that triangle, or returns max float on a miss. Store one value per thread in a buffer.
Then do a reduction and return the minimum (non-negative) value.
I wish I had access to something like CUDA Thrust in DirectX, because I think coding up that reduction is going to be a pain. That's why I'm asking, so I don't do a bunch of work for nothing!
This is totally doable, here is some HLSL code that allows to perform that (and also handles the case where you hit 2 triangles with the same distance).
I assume that you know how to create resources (Structured Buffer) and bind them to compute pipeline.
Also I'll consider that your geometry is Indexed.
The first step is to collect triangles that pass the test. Instead of using a "Hit" flag, we will use an Append buffer to only push elements that pass the test.
First create 2 structured buffers (position and triangle indices), and copy your model data onto those.
Then create a Structured Buffer with an Appendable Unordered view.
To perform Hit detection, you can use the following Compute code:
struct rayHit
{
uint triangleID;
float distanceToTriangle;
};
cbuffer cbRaySettings : register(b0)
{
float3 rayFrom;
float3 rayDir;
uint TriangleCount;
};
StructuredBuffer<float3> positionBuffer : register(t0);
StructuredBuffer<uint3> indexBuffer : register(t1);
AppendStructuredBuffer<rayHit> appendRayHitBuffer : register(u0);
void TestTriangle(float3 p1, float3 p2, float3 p3, out bool hit, out float d)
{
//Perform ray/triangle intersection
hit = false;
d = 0.0f;
}
[numthreads(64,1,1)]
void CS_RayAppend(uint3 tid : SV_DispatchThreadID)
{
if (tid.x >= TriangleCount)
return;
uint3 indices = indexBuffer[tid.x];
float3 p1 = positionBuffer[indices.x];
float3 p2 = positionBuffer[indices.y];
float3 p3 = positionBuffer[indices.z];
bool hit;
float d;
TestTriangle(p1,p2,p3,hit, d);
if (hit)
{
rayHit hitData;
hitData.triangleID = tid.x;
hitData.distanceToTriangle = d;
appendRayHitBuffer.Append(hitData);
}
}
Please note that you need to provide a sufficient size for appendRayHitBuffer (worst case scenario is Triangle Count, eg :every triangle is hit by the ray).
Once this is done, the beginning part of the buffer contains hit data, and the unordered view counter the number of triangles that passed the test.
Then you need to create an argument buffer, and a small Byte Address Buffer (size 16, since I don't think runtime will allow 12)
You also need a small structured buffer (one element is enough), which will be used to store minimum distance
Use CopyStructureCount to pass the UnorderedView counter into those buffers (plase note that second and third element of the Argument buffer needs to be both set to 1, as they will be arguments for use dispatch).
Clear the small StructuredBuffer Buffer using UINT_MAXVALUE, and use the Argument buffer with DispatchIndirect
I assume that you will not have many hits, so for next part numthreads will be set to 1,1,1 (if you want to use larger groups, you will need to run another compute shader to build the argument buffer).
Then to find minimum distance:
StructuredBuffer<rayHit> rayHitbuffer : register(t0);
ByteAddressBuffer rayHitCount : register(t1);
RWStructuredBuffer<uint> rwMinBuffer : register(u0);
[numthreads(1,1,1)]
void CS_CalcMin(uint3 tid : SV_DispatchThreadID)
{
uint count = rayHitCount.Load(0);
if (tid.x >= count)
return;
rayHit hit = rayHitbuffer[tid.x];
uint dummy;
InterlockedMin(rwMinBuffer[0],asuint(hit.distanceToTriangle), dummy);
}
Since we expect that hit distance will be greater than zero, we can use asuint and InterlockedMin in that scenario. Also since we use DispatchIndirect, this part is now only applied to the elements that previously passed the test.
Now your single element buffer contains the minimum distance, but not the index( or indices).
Last part, we need to finally extract triangle index that is at the minimum hit distance.
You need again a new StructuredBuffer with an UnorderedView to store the minimum index.
Use the same dispatch arguments as before (indirect), and perform the following:
ByteAddressBuffer rayHitCount : register(t1);
StructuredBuffer<uint> MinDistanceBuffer : register(t2);
AppendStructuredBuffer<uint> appendMinHitIndexBuffer : register(u0);
[numthreads(1,1,1)]
void CS_AppendMin(uint3 tid : SV_DispatchThreadID)
{
uint count = rayHitCount.Load(0);
if (tid.x >= count)
return;
rayHit hit = rayHitbuffer[tid.x];
uint minDist = MinDistanceBuffer[0];
uint d = asuint(hit.distanceToTriangle);
if (d == minDist)
{
appendMinHitIndexBuffer.Append(hit.triangleID);
}
}
Now appendMinHitIndexBuffer contains the triangle index that is the closest (or several if you have that scenario), you can copy it back using a Staging resource and Map your resource for reading.
Actually it makes a lot of sense. Here is also a whitepaper which has some useful shader snippets: http://www.graphicon.ru/html/2012/conference/EN2%20-%20Graphics/gc2012Shumskiy.pdf . Also you can use DirectCompute/CUDA/OpenCL in DirectX, but if I might give you a hint, do it in DirectCompute, because I think it is the least hassle to set up and get it running

Nearest Neighbors in CUDA Particles

Edit 2: Please take a look at this crosspost for TLDR.
Edit: Given that the particles are segmented into grid cells (say 16^3 grid), is it a better idea to let run one work-group for each grid cell and as many work-items in one work-group as there can be maximal number of particles per grid cell?
In that case I could load all particles from neighboring cells into local memory and iterate through them computing some properties. Then I could write specific value into each particle in the current grid cell.
Would this approach be beneficial over running the kernel for all particles and for each iterating over (most of the time the same) neighbors?
Also, what is the ideal ratio of number of particles/number of grid cells?
I'm trying to reimplement (and modify) CUDA Particles for OpenCL and use it to query nearest neighbors for every particle. I've created the following structures:
Buffer P holding all particles' 3D positions (float3)
Buffer Sp storing int2 pairs of particle ids and their spatial hashes. Sp is sorted according to the hash. (The hash is just a simple linear mapping from 3D to 1D – no Z-indexing yet.)
Buffer L storing int2 pairs of starting and ending positions of particular spatial hashes in buffer Sp. Example: L[12] = (int2)(0, 50).
L[12].x is the index (in Sp) of the first particle with spatial hash 12.
L[12].y is the index (in Sp) of the last particle with spatial hash 12.
Now that I have all these buffers, I want to iterate through all the particles in P and for each particle iterate through its nearest neighbors. Currently I have a kernel that looks like this (pseudocode):
__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
size_t gid = get_global_id(0);
float3 curr_particle = P[gid];
int processed_value = 0;
for(int x=-1; x<=1; x++)
for(int y=-1; y<=1; y++)
for(int z=-1; z<=1; z++) {
float3 neigh_position = curr_particle + (float3)(x,y,z)*GRID_CELL_SIDE;
// ugly boundary checking
if ( dot(neigh_position<0, (float3)(1)) +
dot(neigh_position>BOUNDARY, (float3)(1)) != 0)
continue;
int neigh_hash = spatial_hash( neigh_position );
int2 particles_range = L[ neigh_hash ];
for(int p=particles_range.x; p<particles_range.y; p++)
processed_value += heavy_computation( P[ Sp[p].y ] );
}
Out[gid] = processed_value;
}
The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y] in the inner-most for loop) to be causing the slowness.
What I want to do is to use Z-order curve as the spatial hash. That way I could have only 1 for loop iterating through a continuous range of memory when querying neighbors. The only problem is that I don't know what should be the start and stop Z-index values.
The holy grail I want to achieve:
__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
size_t gid = get_global_id(0);
float3 curr_particle = P[gid];
int processed_value = 0;
// How to accomplish this??
// `get_neighbors_range()` returns start and end Z-index values
// representing the start and end near neighbors cells range
int2 nearest_neighboring_cells_range = get_neighbors_range(curr_particle);
int first_particle_id = L[ nearest_neighboring_cells_range.x ].x;
int last_particle_id = L[ nearest_neighboring_cells_range.y ].y;
for(int p=first_particle_id; p<=last_particle_id; p++) {
processed_value += heavy_computation( P[ Sp[p].y ] );
}
Out[gid] = processed_value;
}
You should study the Morton Code algorithms closely. Ericsons Real time collision detection explains that very well.
Ericson - Real time Collision detection
Here is another nice explanation including some tests:
Morton encoding/decoding through bit interleaving: Implementations
Z-Order algorithms only defines the paths of the coordinates in which you can hash from 2 or 3D coordinates to just an integer. Although the algorithm goes deeper for every iteration you have to set the limits yourself. Usually the stop index is denoted by a sentinel. Letting the sentinel stop will tell you at which level the particle is placed. So the maximum level you want to define will tell you the number of cells per dimension. For example with maximum level at 6 you have 2^6 = 64. You will have 64x64x64 cells in your system (3D). That also means that you have to use integer based coordinates. If you use floats you have to convert like coord.x = 64*float_x and so on.
If you know how many cells you have in your system you can define your limits. Are you trying to use a binary octree?
Since particles are in motion (in that CUDA example) you should try to parallelize over the number of particles instead of cells.
If you want to build lists of nearest neighbours you have to map the particles to cells. This is done through a table that is sorted afterwards by cells to particles. Still you should iterate through the particles and access its neighbours.
About your code:
The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y] in the inner-most for loop) to be causing the slowness.
Remember Donald Knuth. You should measure where the bottle neck is. You can use NVCC Profiler and look for bottleneck. Not sure what OpenCL has as profiler.
// ugly boundary checking
if ( dot(neigh_position<0, (float3)(1)) +
dot(neigh_position>BOUNDARY, (float3)(1)) != 0)
continue;
I think you should not branch it that way, how about returning zero when you call heavy_computation. Not sure, but maybe you have sort of a branch prediction here. Try to remove that somehow.
Running parallel over the cells is a good idea only if you have no write accesses to the particle data, otherwise you will have to use atomics. If you go over the particle range instead you read accesses to the cells and neighbours but you create your sum in parallel and you are not forced to some race condiction paradigm.
Also, what is the ideal ratio of number of particles/number of grid cells?
Really depends on your algorithms and the particle packing within your domain, but in your case I would define the cell size equivalent to the particle diameter and just use the number of cells you get.
So if you want to use Z-order and achieve your holy grail, try to use integer coordinates and hash them.
Also try to use larger amounts of particles. About 65000 particles like CUDA examples uses you should consider because that way the parallelisation is mostly efficient; the running processing units are exploited (fewer idles threads).

How to convert an image to a Box2D polygon using the alpha layer and triangulation?

I'm coding a game using Box2D and SFML, and I'd like to let my users import their own textures to use as physics polygons. The polygons are created using the images' alpha layer. It doesn't need to be pixel perfect, and this is where my problem is. If it's pixel-perfect, it's going to be way too buggy when the player gets stuck between two rather complex shapes. I have a working edge-detection algorithm, and it produces something like this. It's pixel per pixel (and the shape it's tracing is simply a square with an dip). After that, I have a simplifying algorithm that produces this. It works fine to me, but if every single corner is traced like that, I'm going to have some problems. The code for the vector-simplifying is this:
//borders is a std::vector containing simple Box2D b2Vec2 (2D vector class containing an x and a y)
//vector shortener
for(unsigned int i = 0; i < borders.size(); i++)
{
int x = 0, y = 0;
int counter = 0;
//get the values for x and y that need to be added to check whether in a line or not
x = borders[i].x - borders[i-1].x;
y = borders[i].y - borders[i-1].y;
//while points are aligned..
while((borders[i].x + x*counter == borders[i + counter].x) && (borders[i].y + y*counter == borders[i+counter].y))
{
counter++;
}
if(counter-1 > i)
{
borders.erase(borders.begin() + i, borders.begin() + i + counter -1);
}
}
So my question is, how can I transform the previous set of vectors into something a bit less precise? Are there any rounding algorithms out there? If so, which is best? Any tips you can give me? It doesn't matter whether the resulting polygon is convex or concave, I'm triangulating it anyways.
Thanks,
AsterAlff

OpenCL for-loop doing strange things

I'm currently implementing terrain generation in OpenCL using layered octaves of noise and I've stumbled upon this problem:
float multinoise2d(float2 position, float scale, int octaves, float persistence)
{
float result = 0.0f;
float sample = 0.0f;
float coefficient = 1.0f;
for(int i = 0; i < octaves; i++){
// get a sample of a simple signed perlin noise
sample = sgnoise2d(position/scale);
if(i > 0){
// Here is the problem:
// Implementation A, this works correctly.
coefficient = pown(persistence, i);
// Implementation B, using this only the first
// noise octave is visible in the terrain.
coefficient = persistence;
persistence = persistence*persistence;
}
result += coefficient * sample;
scale /= 2.0f;
}
return result;
}
Does OpenCL parallelize for-loops, leading to synchronization issues here or am I missing something else?
Any help is appreciated!
the problem of your code is with the lines
coefficient = persistence;
persistence = persistence*persistence;
It should be changed to
coefficient = coefficient *persistence;
otherwise on every iteration
the first coeficient grows by just persistence
pow(persistence, 1) ; pow(persistence, 2); pow(persistence, 3) ....
However the second implementation goes
pow(persistence, 1); pow(persistence, 2); pow(persistence, 4); pow(persistence, 8) ......
soon "persistence" will run above the limit for float and you will get zeros (or undefined behavior) in your answer.
EDIT
Two more things
Accumulation (implementation 2) is not a good idea, specially with real numbers and with algorithms that require accuracy. You might be losing a small fraction of you information every time you accumulate on "persistence" (e.g due to rounding). Prefer direct calculation (1st implementation) over accumulation whenever you can. (plus if this was Serial the 2nd implementation will be readily parallelizable.)
If you are working with AMD OpenCL pay attention to the pow() functions. I have had problems with those on multiple machines on multiple occasions. The functions seem to hang sometimes for no reason. Just FYI.
I'm assuming this is some kind of utility method that is called in your CL kernel. Vivek is correct in his comment above: OpenCL does not parallelize your code for you. You have to leverage OpenCL's facilities for dividing your problem into data-parallel chunks.
Also, I don't see a potential synchronization issue in the above code. All of your variables are in work-item private memory space.

Loop through all pixels and get/set individual pixel color in OpenGL?

I wrote a little thingy with Processing that I now would like to make a Mac OS X Screen Saver of. However, diving in to OpenGL was not as easy as I thought it would be.
Basically I want to loop through all pixels on screen and based on that pixels color set another pixels color.
The Processing code looks like this:
void setup(){
size(500,500, P2D);
frameRate(30);
background(255);
}
void draw(){
for(int x = 0; x<width; x++){
for(int y = 0; y<height; y++){
float xRand2 = x+random(2);
float yRand2 = y+random(2);
int xRand = int(xRand2);
int yRand = int(yRand2);
if(get(x,y) == -16777216){
set(x+xRand, y+yRand, #FFFFFF);
}
else if(get(x,y) == -1){
set(x+xRand, y+yRand, #000000);
}
}
}
}
It's not very pretty and nor is it very effective. However, I'd like to find out how to do something similiar with OpenGL. I don't even know where to start.
The basic idea of OpenGL is that you never set the values of individual pixels manually, because that's often too slow. Instead you render triangles and do all kinds of tricks with them, like textures, blending, etc.
In order to freely program what each individual pixel does in OpenGL, you need to use a technique called shaders. And that's not very easy if you haven't done anything similar before. The idea of shaders is that GPU executes them instead of CPU, which results in very good performance and takes the load off from the CPU. But in your case it is probably a better idea to do it with CPU and not with shaders and OpenGL, as that approach is much easier to start with.
I recommend you use a library like SDL (or possibly glfw), which lets you do things with pixels without hardware acceleration. You can still do it with OpenGL too, though. By using the function glDrawPixels. That function draws raw pixel data to the screen. But it's probably not very fast.
So start by reading some tutorials about SDL, for example.
Edit: If you want to use shaders, the difficulty with them (among other things) is that you can't specify coordinates to which set pixel values. And you can't get pixel values directly from the screen either. One way to do it with shaders would be the following:
Set up two textures: texture A and texture B
Bind one of the textures as a target that you render everything to
Bind another one of the textures as an input texture for the shader
Render a full-screen quadrangle using your shader and show the result on the screen
Swap textures A and B so you became to use your previous result as your next input
Render again
If you don't want to go the shader route, try doing all your pixel modification in the CPU on a 2D memory array, then use glDrawPixels every frame to push your pixels to the screen. It won't be very hardware accelerated, but it might be fine for your purposes. Another thing to try is to use glTexImage2D to bind your new pixel data every frame to a texture and then render a textured quad to the entire screen. I'm not sure which will be faster. My advice is to try these things before jumping into the complexity of shaders.
There are a few bugs in your code that make reverse engineering and porting it harder, and make me wonder if you actually posted the correct code. Assuming that the visual effect produced is what you want, here is a more efficient and more correct draw():
void draw() {
loadPixels();
for(int x = 0; x<width/2; x++) {
for(int y = 0; y<height/2; y++) {
int x_new = 2*x+int(random(2));
int y_new = 2*y+int(random(2));
if (x_new < width && y_new < height) {
int dest_pixel = (y_new*width + x_new);
color c = pixels[y*width+x];
if(c == #FFFFFF){
pixels[dest_pixel] = #000000;
}
else {
pixels[dest_pixel] = #FFFFFF;
}
}
}
}
updatePixels();
}
Note that the upper bounds of the loop are divided by two. As you wrote it, 3/4 of your set() calls were for pixels that are beyond the bounds of the window. The extra if is necessary because of the addition of small random values to the coordinates.
The overall effect of this code could be described as an in-place stretch and invert of the image, with a little bit of randomness thrown in. Because it's an in-place transformation, it can't be easily parallelized or accelerated, so you are best off implementing this as bitmap/texture operations on the CPU. You can do this without having to ever read pixels from the GPU, but you will have to push a screen full of pixels to the GPU each frame.
If you use glDrawPixels with a format argument of GL_LUMINANCE and a type argument of GL_UNSIGNED_BYTE, then you can pretty easily convert this code to operate on a byte array, which will keep the memory consumption down somewhat as compared with using 32-bit RGBA values.

Resources