Dispatching more than 65535 threads - parallel-processing

I'm attempting to skin vertices using DirectCompute. The method of skinning employed is such that you can have a variable amount of weights influencing each vertex (e.g. Md5 meshes are defined this way).
Basically inputs to the compute shader are.
JointsBuffer { float4 orientation, float4 position } Structured buffer SRV
WeightsBuffer { float3 normal, float4 position, float bias, uint jointIndex } Structured buffer SRV
VerticesBuffer { float2 texcoords, uint weightIndex, uint numWeights } Structured buffer SRV
and the output is
SkinnedVerticesBuffer { float3 normal, float4 position, float2 texcoord } Structured buffer UAV
Now the compute shader should be run once per element in the vertex buffer, and using SV_DispatchThreadID the shader attempts to populate the corresponding SkinnedVertex in the SkinnedVerticesBuffer for every Vertex in the VerticesBuffer ( 1:1 correspondence ).
So the problem is that many meshes have greater than 65535 vertices, and the DispatchThreadID command only allows for dispatching that many threads per dimension. Now I can theoretically write something that divides a lot of numbers up into a combination of three factors less than 65535, but I can't possibly do that for prime numbers.
So for example when some mesh with 71993 ( a prime number ) of vertices comes up I can't think of a way to handle it.
I can't over dispatch say 72000 threads with context->Dispatch( 36000, 2, 0 ), because then DispatchThreadID will run out of my buffer bounds.
Right now I'm leaning towards a constant buffer holding the amount of vertices, and then over dispatching to the nearest power of 2 and then simply doing
if( SV_DispatchThreadID > numVertices ) return;
Is this my only option? Anyone else run into this snag.

I've never. But 65000 threads seems like an awful lot.
Then, when I try to find documentation it seems that the values you pass are not threads, but thread groups. Someone on gamedev seems to have performance issues when passing a number as great as 768, so it seems to me that you will have to decrease that huge number.
I'm not sure, but I got the feeling you're misinterpreting these parameters. Try to read again what these values actually mean. (Just a layman's gut feeling, though.)

Related

How can I scale/interpolate an image with indexed values smoothly?

I am wanting to scale grayscale images (input masks, really) with discrete values up smoothly. The values in these images are indexes that represent arbitrary concepts (e.g. "terrain types"; they are usually indices into a table), rather than values on a continuous scale, so they can't be averaged or blended in any way.
Do there exist algorithms that can do this with a more pleasing result than nearest-neighbour, which results in a very blocky, pixelated result? I am looking for something that will at least produce more rounded, more fluid results. The kind of thing that would be ideal would be a whitepaper, or a library (preferably in Java).
I've researched the subject, but I can't find anything. There is plenty about linear or cubic interpolation, etc., but that won't work for indexed values. The only algorithm I ever see mentioned that does not try to average values is nearest-neighbour. But there must be more?
Using colour here for clarity. I do of course understand that the preferred result here is impossible; I'm not asking for something that reconstitutes destroyed information, just hoping for something that will at least guestimate something smoother than the first result.
Scan the destination image and for every corresponding source pixel (non-integer coordinates) check if the colors of the four surrounding pixels are the same. If yes, assign that color.
If not, perform as many bilinear interpolations as there are different colors. For this assign the weight 1 for a given color (each in turn) and 0 for the others, and interpolate the weight. Finally, keep the color with the largest weight.
By analytical geometry, one can show that in bilinear interpolation, the iso-weight curves are arcs of hyperbola. If your magnification is large, you will see them. G1 continuity is not guaranteed. If this is an annoyance, you can work with G1 bicubic interpolation instead.
If this still does not satisfy you, you can try smooth approximating surfaces rather than interpolating ones. But the principle of keeping the color of maximum weight remains.
If there aren't many distinct colors and you want to use ready-made functions, you can work this out as follows:
split the image in several binary images (white for a chosen color, black for background);
magnify all images (to grayscale) using the favorite method;
now implement yourself a function that assigns every pixel the color that has the largest value among the magnified images.
You can also apply a smoothing filter to the binary images before or after magnification.
For the sake of illustration, here is what you would get with two colors at a time (but this easily generalizes).
Color source image:
Smoothing applied to the binary equivalents:
Magnified:
Maximum weight decision:
One thing you could try is to extract a polygon for the boundary of each uniformly-colored region, then upscale and draw the polygon in the output image. You won’t create neatly rounded edges, but you will avoid the stair-case effect of the nearest neighbor interpolation. Upscaling polygons should avoid gaps between the regions too.
I guess that smoothing the shape for each value individually is a way to avoid undesired mixed value.
To handle values individually, here, I started with your nearest-neighbour image v, and create 3 image { A.bmp, B.bmp, C.bmp } by hand.
(each image has only 1 color region and background is black. e.g. A.bmp is below:)
After smoothing the shape for each image, draw these shapes to one result image buffer with different color.
//I use C++ and OpenCV
int main()
{
const std::string FileNames[3] = { "A.bmp", "B.bmp", "C.bmp" };
const cv::Scalar ResultShowColor[3] = { cv::Scalar(0,255,255), cv::Scalar(0,255,0), cv::Scalar(0,0,255) };
cv::Mat Imgs[3];
const int KernelSize = 15;
for( int i=0; i<3; ++i )
{
Imgs[i] = cv::imread( FileNames[i], cv::IMREAD_GRAYSCALE );
if( Imgs[i].empty() )return 0;
cv::threshold( Imgs[i], Imgs[i], 32, 255, cv::THRESH_BINARY );
cv::GaussianBlur( Imgs[i], Imgs[i], cv::Size(KernelSize,KernelSize), 0 );
cv::threshold( Imgs[i], Imgs[i], 255*0.5, 255, cv::THRESH_BINARY );
cv::imshow( FileNames[i], Imgs[i] );
}
cv::Mat ResultImg = cv::Mat::zeros( Imgs[0].size(), CV_8UC3 );
for( int i=0; i<3; ++i )
{
ResultImg.setTo( ResultShowColor[i], Imgs[i] );
}
cv::imshow( "ResultImg", ResultImg );
if( cv::waitKey() == 's' ){ cv::imwrite( "ResultImg.png", ResultImg ); }
return 0;
}
This is result:
Yes, this result is not enough. Gaps exist at the boundaries of shapes.
Therefore some ingenuity is required... but I post this because it might be some hint for you.

depth peeling invariance in webgl (and threejs)

I'm looking at what i think is the first paper for depth peeling (the simplest algorithm?) and I want to implement it with webgl, using three.js
I think I understand the concept and was able to make several peels, with some logic that looks like this:
render(scene, camera) {
const oldAutoClear = this._renderer.autoClear
this._renderer.autoClear = false
setDepthPeelActive(true) //sets a global injected uniform in a singleton elsewhere, every material in the scene has onBeforeRender injected with additional logic and uniforms
let ping
let pong
for (let i = 0; i < this._numPasses; i++) {
const pingPong = i % 2 === 0
ping = pingPong ? 1 : 0
pong = pingPong ? 0 : 1
const writeRGBA = this._screenRGBA[i]
const writeDepth = this._screenDepth[ping]
setDepthPeelPassNumber(i) //was going to try increasing the polygonOffsetUnits here globally,
if (i > 0) {
//all but first pass write to depth
const readDepth = this._screenDepth[pong]
setDepthPeelFirstPass(false)
setDepthPeelPrevDepthTexture(readDepth)
this._depthMaterial.uniforms.uFirstPass.value = 0
this._depthMaterial.uniforms.uPrevDepthTex.value = readDepth
} else {
//first pass just renders to depth
setDepthPeelFirstPass(true)
setDepthPeelPrevDepthTexture(null)
this._depthMaterial.uniforms.uFirstPass.value = 1
this._depthMaterial.uniforms.uPrevDepthTex.value = null
}
scene.overrideMaterial = this._depthMaterial
this._renderer.render(scene, camera, writeDepth, true)
scene.overrideMaterial = null
this._renderer.render(scene, camera, writeRGBA, true)
}
this._quad.material = this._blitMaterial
// this._blitMaterial.uniforms.uTexture.value = this._screenDepth[ping]
this._blitMaterial.uniforms.uTexture.value = this._screenRGBA[
this._currentBlitTex
]
console.log(this._currentBlitTex)
this._renderer.render(this._scene, this._camera)
this._renderer.autoClear = oldAutoClear
}
I'm using gl_FragCoord.z to do the test, and packing the depth into a 8bit RGBA texture, with a shader that looks like this:
float depth = gl_FragCoord.z;
vec4 pp = packDepthToRGBA( depth );
if( uFirstPass == 0 ){
float prevDepth = unpackRGBAToDepth( texture2D( uPrevDepthTex , vSS));
if( depth <= prevDepth + 0.0001) {
discard;
}
}
gl_FragColor = pp;
Varying vSS is computed in the vertex shader, after the projection:
vSS.xy = gl_Position.xy * .5 + .5;
The basic idea seems to work and i get peels, but only if i using the fudge factor. It looks like it fails though as the angle gets more obtuse (which is why polygonOffset needs both the factor and units, to account for the slope?).
I didn't understand at all how the invariance is solved. I don't understand how the mentioned extension is being used other than it seems to be overriding the fragment depth, but with what?
I must admit that I'm not sure even which interpolation is being referred to here since every pixel is aligned, i'm just using nearest filtering.
I did see some hints about depth buffer precision, but not really understanding the issue, i wanted to try packing the depth into only three channels and see what happens.
Having such a small fudge factor make it sort of work tells me that likely all these sampled and computed depths do seem to exist in the same space. But this seems to be the same issue as if using gl.EQUAL for depth testing? For shits and giggles i tried to override the depth with the unpacked depth immediately after packing it, but it didn't seem to do anything.
edit
Increasing the polygon offset with each peel seems to have done the trick. I got some fighting though with the lines but i think it's due to the fact that i was already using offset to draw them and i need to include that in the peel offset. I'd still love to understand more about the problem.
The depth buffer stores depths :) Depending on the 'far' and 'near' planes the perspective projection tends to set the depths of the points "stacked" in just a short part of the buffer. It's not linear in z. You can see this on your own setting a different color depending on the depth and render some triangle that takes most of near-far distance.
A shadow map stores depths (distances to light)... calculated after projection. Later, in the second or following pass, you will compare those depths, which are "stacked", which makes some comparisons to fail due to they are very similar values: hazardous variances.
You can user a more fine-grained depth buffer, 24 bits instead of 16 or 8 bits. This may solve part of the problem.
There's another issue: the perspective division or z/w, needed to get normalized device coordinates (NDC). It occurs after vertex shader, so gl_FragDepth = gl_FragCoord.z is affected.
The other approach is to store the depths calculated in some space that doesn't suffer "stacking" nor perspective division. Camera space is one. In other words, you can calculate the depth undoing projection in the vertex shader.
The article you link to is for old fixed-pipeline, without shaders. It shows a NVIDIA extension to deal with these variances.

DirectX 11 compute shader for ray/mesh intersect

I recently converted a DirectX 9 application that was using D3DXIntersect to find ray/mesh intersections to DirectX 11. Since D3DXIntersect is not available in DX11, I wrote my own code to find the intersection, which just loops over all the triangles in the mesh and tests them, keeping track of the closest hit to the origin. This is done on the CPU side and works fine for picking via the GUI, but I have another part of the application that creates a new mesh from an existing one based on several different viewpoints, and I need to check line of sight for every triangle in the mesh many times. This gets pretty slow.
Does it make sense to use a DX11 compute shader to do this (i.e. would there be a significant speedup from doing it on the CPU)? I searched the internet but could not find an existing example.
Assuming the answer is yes, here is the approach I am thinking of:
Launch a thread for every triangle in my mesh
Each thread computes the distance to a hit on that triangle, or returns max float on a miss. Store one value per thread in a buffer.
Then do a reduction and return the minimum (non-negative) value.
I wish I had access to something like CUDA Thrust in DirectX, because I think coding up that reduction is going to be a pain. That's why I'm asking, so I don't do a bunch of work for nothing!
This is totally doable, here is some HLSL code that allows to perform that (and also handles the case where you hit 2 triangles with the same distance).
I assume that you know how to create resources (Structured Buffer) and bind them to compute pipeline.
Also I'll consider that your geometry is Indexed.
The first step is to collect triangles that pass the test. Instead of using a "Hit" flag, we will use an Append buffer to only push elements that pass the test.
First create 2 structured buffers (position and triangle indices), and copy your model data onto those.
Then create a Structured Buffer with an Appendable Unordered view.
To perform Hit detection, you can use the following Compute code:
struct rayHit
{
uint triangleID;
float distanceToTriangle;
};
cbuffer cbRaySettings : register(b0)
{
float3 rayFrom;
float3 rayDir;
uint TriangleCount;
};
StructuredBuffer<float3> positionBuffer : register(t0);
StructuredBuffer<uint3> indexBuffer : register(t1);
AppendStructuredBuffer<rayHit> appendRayHitBuffer : register(u0);
void TestTriangle(float3 p1, float3 p2, float3 p3, out bool hit, out float d)
{
//Perform ray/triangle intersection
hit = false;
d = 0.0f;
}
[numthreads(64,1,1)]
void CS_RayAppend(uint3 tid : SV_DispatchThreadID)
{
if (tid.x >= TriangleCount)
return;
uint3 indices = indexBuffer[tid.x];
float3 p1 = positionBuffer[indices.x];
float3 p2 = positionBuffer[indices.y];
float3 p3 = positionBuffer[indices.z];
bool hit;
float d;
TestTriangle(p1,p2,p3,hit, d);
if (hit)
{
rayHit hitData;
hitData.triangleID = tid.x;
hitData.distanceToTriangle = d;
appendRayHitBuffer.Append(hitData);
}
}
Please note that you need to provide a sufficient size for appendRayHitBuffer (worst case scenario is Triangle Count, eg :every triangle is hit by the ray).
Once this is done, the beginning part of the buffer contains hit data, and the unordered view counter the number of triangles that passed the test.
Then you need to create an argument buffer, and a small Byte Address Buffer (size 16, since I don't think runtime will allow 12)
You also need a small structured buffer (one element is enough), which will be used to store minimum distance
Use CopyStructureCount to pass the UnorderedView counter into those buffers (plase note that second and third element of the Argument buffer needs to be both set to 1, as they will be arguments for use dispatch).
Clear the small StructuredBuffer Buffer using UINT_MAXVALUE, and use the Argument buffer with DispatchIndirect
I assume that you will not have many hits, so for next part numthreads will be set to 1,1,1 (if you want to use larger groups, you will need to run another compute shader to build the argument buffer).
Then to find minimum distance:
StructuredBuffer<rayHit> rayHitbuffer : register(t0);
ByteAddressBuffer rayHitCount : register(t1);
RWStructuredBuffer<uint> rwMinBuffer : register(u0);
[numthreads(1,1,1)]
void CS_CalcMin(uint3 tid : SV_DispatchThreadID)
{
uint count = rayHitCount.Load(0);
if (tid.x >= count)
return;
rayHit hit = rayHitbuffer[tid.x];
uint dummy;
InterlockedMin(rwMinBuffer[0],asuint(hit.distanceToTriangle), dummy);
}
Since we expect that hit distance will be greater than zero, we can use asuint and InterlockedMin in that scenario. Also since we use DispatchIndirect, this part is now only applied to the elements that previously passed the test.
Now your single element buffer contains the minimum distance, but not the index( or indices).
Last part, we need to finally extract triangle index that is at the minimum hit distance.
You need again a new StructuredBuffer with an UnorderedView to store the minimum index.
Use the same dispatch arguments as before (indirect), and perform the following:
ByteAddressBuffer rayHitCount : register(t1);
StructuredBuffer<uint> MinDistanceBuffer : register(t2);
AppendStructuredBuffer<uint> appendMinHitIndexBuffer : register(u0);
[numthreads(1,1,1)]
void CS_AppendMin(uint3 tid : SV_DispatchThreadID)
{
uint count = rayHitCount.Load(0);
if (tid.x >= count)
return;
rayHit hit = rayHitbuffer[tid.x];
uint minDist = MinDistanceBuffer[0];
uint d = asuint(hit.distanceToTriangle);
if (d == minDist)
{
appendMinHitIndexBuffer.Append(hit.triangleID);
}
}
Now appendMinHitIndexBuffer contains the triangle index that is the closest (or several if you have that scenario), you can copy it back using a Staging resource and Map your resource for reading.
Actually it makes a lot of sense. Here is also a whitepaper which has some useful shader snippets: http://www.graphicon.ru/html/2012/conference/EN2%20-%20Graphics/gc2012Shumskiy.pdf . Also you can use DirectCompute/CUDA/OpenCL in DirectX, but if I might give you a hint, do it in DirectCompute, because I think it is the least hassle to set up and get it running

How to improve texture access performance in OpenGL shaders?

Conditions
I use OpenGL 3 and PyOpenGL.
I have ~50 thousand (53'490) vertices and each of them has 199 vec3 attributes which determine their displacement. It's impossible to store this data as regular vertices attributes, so I use texture.
The problem is: non-parallelized C function calculates displacement of vertices as fast as GLSL and even faster in some cases. I've checked: the issue is texture read and I don't understand how to optimize it.
I've written two different shaders. One calculates new model in ~0.09s and another one in ~0.12s (including attributes assignment, which is equal for both cases).
Code
Both shaders start with
#version 300 es
in vec3 vin_position;
out vec4 vin_pos;
uniform mat4 rotation_matrix;
uniform float coefficients[199];
uniform sampler2D principal_components;
The faster one is
void main(void) {
int c_pos = gl_VertexID;
int texture_size = 8192;
ivec2 texPos = ivec2(c_pos % texture_size, c_pos / texture_size);
vec4 tmp = vec4(0.0);
for (int i = 0; i < 199; i++) {
tmp += texelFetch(principal_components, texPos, 0) * coefficients[i];
c_pos += 53490;
texPos = ivec2(c_pos % texture_size, c_pos / texture_size);
}
gl_Position = rotation_matrix
* vec4(vin_position + tmp.xyz, 246006.0);
vin_pos = gl_Position;
}
The slower one
void main(void) {
int texture_size = 8192;
int columns = texture_size - texture_size % 199;
int c_pos = gl_VertexID * 199;
ivec2 texPos = ivec2(c_pos % columns, c_pos / columns);
vec4 tmp = vec3(0.0);
for (int i = 0; i < 199; i++) {
tmp += texelFetch(principal_components, texPos, 0) * coefficients[i];
texPos.x++;
}
gl_Position = rotation_matrix
* vec4(vin_position + tmp.xyz, 246006.0);
vin_pos = gl_Position;
}
The main idea of difference between them:
in the first case attributes of vertices are stored in following way:
first attributes of all vertices
second attributes of all vertices
...
last attributes of all vertices
in the second case attributes of vertices are stored in another way:
all attributes of the first vertex
all attributes of the second vertex
...
all attributes of the last vertex
also in the second example data is aligned so that all attributes of each vertex stored only in one row. This means that if I know the row and column of the first attribute of some vertex, I need only to increment x component of texture coordinate
I thought, that aligned data will be accessed faster.
Questions
Why is data not accessed faster?
How can I increase performance of it?
Is there ability to link texture chunk with vertex?
Are there recommendations for data alignment, good related article about caching in GPUs (Intel HD, nVidia GeForce)?
Notes
coefficients array changed from frame to frame, otherwise there's no problem: I could precalculate the model and be happy
Why is data not accessed faster?
Because GPUs are not magical. GPUs gain performance by performing calculations in parallel. Performing 1 million texel fetches, no matter how it happens, is not going to be fast.
If you were using the results of those textures to do lighting computations, it would appear fast because the cost of the lighting computation would be hidden by the latency of the memory fetches. You are taking the results of a fetch, doing a multiply/add, then doing another fetch. That's slow.
Is there ability to link texture chunk with vertex?
Even if there was (and there isn't), how would that help? GPUs execute operations in parallel. That means multiple vertices are being processed simultaneously, each accessing 200 textures.
So what would aid performance there is making each texture access coherent. That is, neighboring vertices would access neighboring texels, thus making the texture fetches more cache efficient. But there's no way to know what vertices will be considered "neighbors". And texture swizzle layouts are implementation dependent, so even if you did know the order of vertex processing, you couldn't adjust your texture to take local advantage of it.
The best way to do that would be to ditch vertex shaders and texture accesses in favor of compute shaders and SSBOs. That way, you have direct knowledge of the locality of your accesses, by setting the work group size. With SSBOs, you can arrange your array in whatever fashion gives you the best locality of access for each wavefront.
But things like this are the equivalent of putting band-aids on a gaping wound.
How can I increase performance of it?
Stop doing so many texture fetches.
I'm being completely serious. While there are ways to mitigate the costs of what you're doing, the most effective solution is to change your algorithm so that it doesn't need to do that much work.
Your algorithm looks suspiciously like vertex morphing via a palette of "poses", with the coefficient specifying the weight applied to each pose. If that's the case, then odds are good that most of your coefficients are either 0 or negligibly small. If so, then you're wasting vast amounts of time accessing textures only to transform their contributions into nothing.
If most of your coefficients are 0, then the best thing to do would be to pick some arbitrary and small number for the maximum number of coefficients that can affect the result. For example, 8. You send an array of 8 indices and coefficients to the shader as uniforms. Then you walk that array, fetching only 8 times. And you might be able to get away with just 4.

Nearest Neighbors in CUDA Particles

Edit 2: Please take a look at this crosspost for TLDR.
Edit: Given that the particles are segmented into grid cells (say 16^3 grid), is it a better idea to let run one work-group for each grid cell and as many work-items in one work-group as there can be maximal number of particles per grid cell?
In that case I could load all particles from neighboring cells into local memory and iterate through them computing some properties. Then I could write specific value into each particle in the current grid cell.
Would this approach be beneficial over running the kernel for all particles and for each iterating over (most of the time the same) neighbors?
Also, what is the ideal ratio of number of particles/number of grid cells?
I'm trying to reimplement (and modify) CUDA Particles for OpenCL and use it to query nearest neighbors for every particle. I've created the following structures:
Buffer P holding all particles' 3D positions (float3)
Buffer Sp storing int2 pairs of particle ids and their spatial hashes. Sp is sorted according to the hash. (The hash is just a simple linear mapping from 3D to 1D – no Z-indexing yet.)
Buffer L storing int2 pairs of starting and ending positions of particular spatial hashes in buffer Sp. Example: L[12] = (int2)(0, 50).
L[12].x is the index (in Sp) of the first particle with spatial hash 12.
L[12].y is the index (in Sp) of the last particle with spatial hash 12.
Now that I have all these buffers, I want to iterate through all the particles in P and for each particle iterate through its nearest neighbors. Currently I have a kernel that looks like this (pseudocode):
__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
size_t gid = get_global_id(0);
float3 curr_particle = P[gid];
int processed_value = 0;
for(int x=-1; x<=1; x++)
for(int y=-1; y<=1; y++)
for(int z=-1; z<=1; z++) {
float3 neigh_position = curr_particle + (float3)(x,y,z)*GRID_CELL_SIDE;
// ugly boundary checking
if ( dot(neigh_position<0, (float3)(1)) +
dot(neigh_position>BOUNDARY, (float3)(1)) != 0)
continue;
int neigh_hash = spatial_hash( neigh_position );
int2 particles_range = L[ neigh_hash ];
for(int p=particles_range.x; p<particles_range.y; p++)
processed_value += heavy_computation( P[ Sp[p].y ] );
}
Out[gid] = processed_value;
}
The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y] in the inner-most for loop) to be causing the slowness.
What I want to do is to use Z-order curve as the spatial hash. That way I could have only 1 for loop iterating through a continuous range of memory when querying neighbors. The only problem is that I don't know what should be the start and stop Z-index values.
The holy grail I want to achieve:
__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
size_t gid = get_global_id(0);
float3 curr_particle = P[gid];
int processed_value = 0;
// How to accomplish this??
// `get_neighbors_range()` returns start and end Z-index values
// representing the start and end near neighbors cells range
int2 nearest_neighboring_cells_range = get_neighbors_range(curr_particle);
int first_particle_id = L[ nearest_neighboring_cells_range.x ].x;
int last_particle_id = L[ nearest_neighboring_cells_range.y ].y;
for(int p=first_particle_id; p<=last_particle_id; p++) {
processed_value += heavy_computation( P[ Sp[p].y ] );
}
Out[gid] = processed_value;
}
You should study the Morton Code algorithms closely. Ericsons Real time collision detection explains that very well.
Ericson - Real time Collision detection
Here is another nice explanation including some tests:
Morton encoding/decoding through bit interleaving: Implementations
Z-Order algorithms only defines the paths of the coordinates in which you can hash from 2 or 3D coordinates to just an integer. Although the algorithm goes deeper for every iteration you have to set the limits yourself. Usually the stop index is denoted by a sentinel. Letting the sentinel stop will tell you at which level the particle is placed. So the maximum level you want to define will tell you the number of cells per dimension. For example with maximum level at 6 you have 2^6 = 64. You will have 64x64x64 cells in your system (3D). That also means that you have to use integer based coordinates. If you use floats you have to convert like coord.x = 64*float_x and so on.
If you know how many cells you have in your system you can define your limits. Are you trying to use a binary octree?
Since particles are in motion (in that CUDA example) you should try to parallelize over the number of particles instead of cells.
If you want to build lists of nearest neighbours you have to map the particles to cells. This is done through a table that is sorted afterwards by cells to particles. Still you should iterate through the particles and access its neighbours.
About your code:
The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y] in the inner-most for loop) to be causing the slowness.
Remember Donald Knuth. You should measure where the bottle neck is. You can use NVCC Profiler and look for bottleneck. Not sure what OpenCL has as profiler.
// ugly boundary checking
if ( dot(neigh_position<0, (float3)(1)) +
dot(neigh_position>BOUNDARY, (float3)(1)) != 0)
continue;
I think you should not branch it that way, how about returning zero when you call heavy_computation. Not sure, but maybe you have sort of a branch prediction here. Try to remove that somehow.
Running parallel over the cells is a good idea only if you have no write accesses to the particle data, otherwise you will have to use atomics. If you go over the particle range instead you read accesses to the cells and neighbours but you create your sum in parallel and you are not forced to some race condiction paradigm.
Also, what is the ideal ratio of number of particles/number of grid cells?
Really depends on your algorithms and the particle packing within your domain, but in your case I would define the cell size equivalent to the particle diameter and just use the number of cells you get.
So if you want to use Z-order and achieve your holy grail, try to use integer coordinates and hash them.
Also try to use larger amounts of particles. About 65000 particles like CUDA examples uses you should consider because that way the parallelisation is mostly efficient; the running processing units are exploited (fewer idles threads).

Resources