DirectX 11 compute shader for ray/mesh intersect

DirectX 11 compute shader for ray/mesh intersect - directx-11

I recently converted a DirectX 9 application that was using D3DXIntersect to find ray/mesh intersections to DirectX 11. Since D3DXIntersect is not available in DX11, I wrote my own code to find the intersection, which just loops over all the triangles in the mesh and tests them, keeping track of the closest hit to the origin. This is done on the CPU side and works fine for picking via the GUI, but I have another part of the application that creates a new mesh from an existing one based on several different viewpoints, and I need to check line of sight for every triangle in the mesh many times. This gets pretty slow.
Does it make sense to use a DX11 compute shader to do this (i.e. would there be a significant speedup from doing it on the CPU)? I searched the internet but could not find an existing example.
Assuming the answer is yes, here is the approach I am thinking of:
Launch a thread for every triangle in my mesh
Each thread computes the distance to a hit on that triangle, or returns max float on a miss. Store one value per thread in a buffer.
Then do a reduction and return the minimum (non-negative) value.
I wish I had access to something like CUDA Thrust in DirectX, because I think coding up that reduction is going to be a pain. That's why I'm asking, so I don't do a bunch of work for nothing!

This is totally doable, here is some HLSL code that allows to perform that (and also handles the case where you hit 2 triangles with the same distance).
I assume that you know how to create resources (Structured Buffer) and bind them to compute pipeline.
Also I'll consider that your geometry is Indexed.
The first step is to collect triangles that pass the test. Instead of using a "Hit" flag, we will use an Append buffer to only push elements that pass the test.
First create 2 structured buffers (position and triangle indices), and copy your model data onto those.
Then create a Structured Buffer with an Appendable Unordered view.
To perform Hit detection, you can use the following Compute code:
struct rayHit
{
uint triangleID;
float distanceToTriangle;
};
cbuffer cbRaySettings : register(b0)
{
float3 rayFrom;
float3 rayDir;
uint TriangleCount;
};
StructuredBuffer<float3> positionBuffer : register(t0);
StructuredBuffer<uint3> indexBuffer : register(t1);
AppendStructuredBuffer<rayHit> appendRayHitBuffer : register(u0);
void TestTriangle(float3 p1, float3 p2, float3 p3, out bool hit, out float d)
{
//Perform ray/triangle intersection
hit = false;
d = 0.0f;
}
[numthreads(64,1,1)]
void CS_RayAppend(uint3 tid : SV_DispatchThreadID)
{
if (tid.x >= TriangleCount)
return;
uint3 indices = indexBuffer[tid.x];
float3 p1 = positionBuffer[indices.x];
float3 p2 = positionBuffer[indices.y];
float3 p3 = positionBuffer[indices.z];
bool hit;
float d;
TestTriangle(p1,p2,p3,hit, d);
if (hit)
{
rayHit hitData;
hitData.triangleID = tid.x;
hitData.distanceToTriangle = d;
appendRayHitBuffer.Append(hitData);
}
}
Please note that you need to provide a sufficient size for appendRayHitBuffer (worst case scenario is Triangle Count, eg :every triangle is hit by the ray).
Once this is done, the beginning part of the buffer contains hit data, and the unordered view counter the number of triangles that passed the test.
Then you need to create an argument buffer, and a small Byte Address Buffer (size 16, since I don't think runtime will allow 12)
You also need a small structured buffer (one element is enough), which will be used to store minimum distance
Use CopyStructureCount to pass the UnorderedView counter into those buffers (plase note that second and third element of the Argument buffer needs to be both set to 1, as they will be arguments for use dispatch).
Clear the small StructuredBuffer Buffer using UINT_MAXVALUE, and use the Argument buffer with DispatchIndirect
I assume that you will not have many hits, so for next part numthreads will be set to 1,1,1 (if you want to use larger groups, you will need to run another compute shader to build the argument buffer).
Then to find minimum distance:
StructuredBuffer<rayHit> rayHitbuffer : register(t0);
ByteAddressBuffer rayHitCount : register(t1);
RWStructuredBuffer<uint> rwMinBuffer : register(u0);
[numthreads(1,1,1)]
void CS_CalcMin(uint3 tid : SV_DispatchThreadID)
{
uint count = rayHitCount.Load(0);
if (tid.x >= count)
return;
rayHit hit = rayHitbuffer[tid.x];
uint dummy;
InterlockedMin(rwMinBuffer[0],asuint(hit.distanceToTriangle), dummy);
}
Since we expect that hit distance will be greater than zero, we can use asuint and InterlockedMin in that scenario. Also since we use DispatchIndirect, this part is now only applied to the elements that previously passed the test.
Now your single element buffer contains the minimum distance, but not the index( or indices).
Last part, we need to finally extract triangle index that is at the minimum hit distance.
You need again a new StructuredBuffer with an UnorderedView to store the minimum index.
Use the same dispatch arguments as before (indirect), and perform the following:
ByteAddressBuffer rayHitCount : register(t1);
StructuredBuffer<uint> MinDistanceBuffer : register(t2);
AppendStructuredBuffer<uint> appendMinHitIndexBuffer : register(u0);
[numthreads(1,1,1)]
void CS_AppendMin(uint3 tid : SV_DispatchThreadID)
{
uint count = rayHitCount.Load(0);
if (tid.x >= count)
return;
rayHit hit = rayHitbuffer[tid.x];
uint minDist = MinDistanceBuffer[0];
uint d = asuint(hit.distanceToTriangle);
if (d == minDist)
{
appendMinHitIndexBuffer.Append(hit.triangleID);
}
}
Now appendMinHitIndexBuffer contains the triangle index that is the closest (or several if you have that scenario), you can copy it back using a Staging resource and Map your resource for reading.

Actually it makes a lot of sense. Here is also a whitepaper which has some useful shader snippets: http://www.graphicon.ru/html/2012/conference/EN2%20-%20Graphics/gc2012Shumskiy.pdf . Also you can use DirectCompute/CUDA/OpenCL in DirectX, but if I might give you a hint, do it in DirectCompute, because I think it is the least hassle to set up and get it running

Related

OpenCL crash when calling finish()

I am writing an OpenCL app on mac using c++, and it crashes in certain cases depending on the work size.
The program crashes due to a SIGABRT.
Is there any way to get more information about the error?
Why is SIGABRT being raised? Can I catch it?
EDIT:
I realize that this program is a doozie, however I will try to explain it in case anyone would like to take a stab at it.
Through debugging I discovered that the cause of the SIGABRT was one of the kernels timing out.
The program is a tile-based 3D renderer. It is an OpenCL implementation of this algorithm: https://github.com/ssloy/tinyrenderer
The screen is divided into 8x8 tiles. One of the kernels (the tiler) computes which polygons overlap each tile, storing the results in a data structure called tilePolys. A subsequent kernel (the rasterizer), which runs one work item per tile, iterates over the list of polys occupying the tile and rasterizes them.
The tiler writes to an integer buffer which is a list of lists of polygon indices. Each list is of a fixed size (polysPerTile + 1 for the count) where the first element is the count and the subsequent polysPerTile elements are indices of polygons in the tile. There is one such list per tile.
For some reason in certain cases the tiler writes a very large poly count (13172746) to one of the tile's lists in tilePolys. This causes the rasterizer to loop for a long time and time out.
The strange thing is that the index to which the large count is written is never accessed by the tiler.
The code for the tiler kernel is below:
// this kernel is executed once per polygon
// it computes which tiles are occupied by the polygon and adds the index of the polygon to the list for that tile
kernel void tiler(
// number of polygons
ulong nTris,
// width of screen
int width,
// height of screen
int height,
// number of tiles in x direction
int tilesX,
// number of tiles in y direction
int tilesY,
// number of pixels per tile (tiles are square)
int tileSize,
// size of the polygon list for each tile
int polysPerTile,
// 4x4 matrix representing the viewport
global const float4* viewport,
// vertex positions
global const float* vertices,
// indices of vertices
global const int* indices,
// array of array-lists of polygons per tile
// structure of list is an int representing the number of polygons covering that tile,
// followed by [polysPerTile] integers representing the indices of the polygons in that tile
// there are [tilesX*tilesY] such arraylists
volatile global int* tilePolys)
{
size_t faceInd = get_global_id(0);
// compute vertex position in viewport space
float3 vs[3];
for(int i = 0; i < 3; i++) {
// indices are vertex/uv/normal
int vertInd = indices[faceInd*9+i*3];
float4 vertHomo = (float4)(vertices[vertInd*4], vertices[vertInd*4+1], vertices[vertInd*4+2], vertices[vertInd*4+3]);
vertHomo = vec4_mul_mat4(vertHomo, viewport);
vs[i] = vertHomo.xyz / vertHomo.w;
}
float2 bboxmin = (float2)(INFINITY,INFINITY);
float2 bboxmax = (float2)(-INFINITY,-INFINITY);
// size of screen
float2 clampCoords = (float2)(width-1, height-1);
// compute bounding box of triangle in screen space
for (int i=0; i<3; i++) {
for (int j=0; j<2; j++) {
bboxmin[j] = max(0.f, min(bboxmin[j], vs[i][j]));
bboxmax[j] = min(clampCoords[j], max(bboxmax[j], vs[i][j]));
}
}
// transform bounding box to tile space
int2 tilebboxmin = (int2)(bboxmin[0] / tileSize, bboxmin[1] / tileSize);
int2 tilebboxmax = (int2)(bboxmax[0] / tileSize, bboxmax[1] / tileSize);
// loop over all tiles in bounding box
for(int x = tilebboxmin[0]; x <= tilebboxmax[0]; x++) {
for(int y = tilebboxmin[1]; y <= tilebboxmax[1]; y++) {
// get index of tile
int tileInd = y * tilesX + x;
// get start index of polygon list for this tile
int counterInd = tileInd * (polysPerTile + 1);
// get current number of polygons in list
int numPolys = atomic_inc(&tilePolys[counterInd]);
// if list is full, skip tile
if(numPolys >= polysPerTile) {
// decrement the count because we will not add to the list
atomic_dec(&tilePolys[counterInd]);
} else {
// otherwise add the poly to the list
// the index is the offset + numPolys + 1 as tilePolys[counterInd] holds the poly count
int ind = counterInd + numPolys + 1;
tilePolys[ind] = (int)(faceInd);
}
}
}
}
My theories are that either:
I have incorrectly implemented the atomic functions for reading and incrementing the count
I am using an incorrect number format causing garbage to be written into tilePolys
One of my other kernels is inadvertently writing into the tilePolys buffer
I do not think it is the last one though because if instead of writing faceInd to tilePolys, I write a constant value, the large poly count disappears.
tilePolys[counterInd+numPolys+1] = (int)(faceInd); // this is the problem line
tilePolys[counterInd+numPolys+1] = (int)(5); // this fixes the issue

It looks like your kernel is crashing on the GPU itself. You can't really get any extra diagnostics about that directly, at least not on macOS. You'll need to start narrowing down the problem. Some suggestions:
As the crash is currently happening in clFinish() you don't know what asynchronous command is causing the crash. Try switching all your enqueue calls to blocking mode. This should cause it to crash in the call that's actually going wrong.
Check return/error codes on all OpenCL API calls. Sometimes, ignoring an error from an earlier call can cause problems in a later call which relies on earlier results. For example, if creating a buffer fails, passing the result of that buffer creation as a kernel argument will cause problems when trying to run the kernel.
The most likely reason for the crash is that your OpenCL kernel is accessing memory out of bounds or is otherwise misusing pointers. Re-check any array index calculations.
Check if the problem occurs with smaller work batches. Scale up from one workgroup (or work item if not using groups) and see if it only occurs beyond a certain work size. This may give you a clue about buffer sizes and array indices that might be causing the crash.
Systematically comment out parts of your kernel. If the crash goes away if you comment out a specific piece of code, there's a good chance the problem is in that code.
If you've narrowed the problem down to a small area of code but can't work out where it's coming from, start recording diagnostic output to check that variables have the values you're expecting.
Without seeing any code, I can't give you any more specific advice than that.
Note that OpenCL is deprecated on macOS, so if you're specifically targeting that platform and don't need to support Linux, Windows, etc. I recommend learning Metal Compute instead. Apple has made it clear that this is the GPU programming platform they want to support, and the tooling for it is already much better than their OpenCL tooling ever was.
I suspect Apple will eventually stop implementing OpenCL support when they release a Mac with a new type of GPU, so even if you're targeting the Mac as well as other platforms, you will probably need to switch to Metal on the Mac somewhere down the line anyway. As of macOS 10.14, the minimum system requirements of the OS already include a Metal-capable GPU, so you only need OpenCL as a fallback if you wish to support all Mac models able to run 10.13 or an even older OS version.

Few objects moving straight one by one in Unity

I'am trying to create some snake-like movement, but i cant implement algorithm to move one body part straight by another and so on.
I wanna to have some auto-moved snake which consists of separate blocks ( spheres ). This snake should move along some path. I generate path with bezier spline and have already implemented one future snake's part along it. Point for head is obtained from spline by next api:
class BezierSpline
{
Vector3 GetPoint(float progress) // 0 to 1
}
And than I have SnakeMovement script
public class SnakeMovement : MonoBehaviour
{
public BezierSpline Path;
public List<Transform> Parts;
public float minDistance = 0.25f;
public float speed = 1;
//.....
void Update()
{
Vector3 position = Path.GetPoint(progress);
Parts.First().localPosition = position;
Parts.First().LookAt(position + Path.GetDirection(progress));
for (int i = 1; i < Parts.Count; i++)
{
Transform curBody = Parts[i];
Transform prevBody = Parts[i - 1];
float dist = Vector3.Distance(prevBody.position, curBody.position);
Vector3 newP = prevBody.position;
newP.y = Parts[0].position.y;
float t = Time.deltaTime * dist / minDistance * curspeed;
curBody.position = Vector3.Slerp(curBody.position, newP, t);
curBody.rotation = Quaternion.Slerp(curBody.rotation, prevBody.rotation, t);
}
//....
}
For now, if I stopped head movement all parts dont preserve distance and keep moving to the head position. Another problem with above algorithm is that parts don't exectly follow the head path. They can "cut" corners while turning.
The main idea is to have user/ai control for only head(first body part) and each followed part should exectly repeat head path and preserve distance between its neighbours.

For a snake like motion you are likely to get lots of strange behaviours if you treat spheres as seperate objects. While i can imagine its possible to get it to work, I think this is not the best approach.
First solution that comes to mind is to create a List, onto which you would add to index 0, on every frame, the position of the head of the snake.
The list would grow, and all the other segments would wait their turn, so lag x frames, and on each update segment y would have position of list[x*y]
If Count() of the list is greater than number_of_segments*lag, you RemoveAt(Count()-1)
This can be optimized as changing the list is somewhat costly (a ring buffer would be better suited, but a Queue could also work. For starters i find Lists much easier to follow and you can always optimize later). This may behave a bit awkward if your framerate varies a lot but should be very stable in general (as in - no unpredictable motion, we only re-use the same values over and over)
Second method:
You mentioned using a bezier spline to generate a path. beziers are parametrized by a float t so you have something like
SplineAt(t).
if you take your bezier_path_length and distance_between_segments, than segment n should have position of
SplineAt(t-n*distance_between_segments/bezier_path_length)

How to improve texture access performance in OpenGL shaders?

Conditions
I use OpenGL 3 and PyOpenGL.
I have ~50 thousand (53'490) vertices and each of them has 199 vec3 attributes which determine their displacement. It's impossible to store this data as regular vertices attributes, so I use texture.
The problem is: non-parallelized C function calculates displacement of vertices as fast as GLSL and even faster in some cases. I've checked: the issue is texture read and I don't understand how to optimize it.
I've written two different shaders. One calculates new model in ~0.09s and another one in ~0.12s (including attributes assignment, which is equal for both cases).
Code
Both shaders start with
#version 300 es
in vec3 vin_position;
out vec4 vin_pos;
uniform mat4 rotation_matrix;
uniform float coefficients[199];
uniform sampler2D principal_components;
The faster one is
void main(void) {
int c_pos = gl_VertexID;
int texture_size = 8192;
ivec2 texPos = ivec2(c_pos % texture_size, c_pos / texture_size);
vec4 tmp = vec4(0.0);
for (int i = 0; i < 199; i++) {
tmp += texelFetch(principal_components, texPos, 0) * coefficients[i];
c_pos += 53490;
texPos = ivec2(c_pos % texture_size, c_pos / texture_size);
}
gl_Position = rotation_matrix
* vec4(vin_position + tmp.xyz, 246006.0);
vin_pos = gl_Position;
}
The slower one
void main(void) {
int texture_size = 8192;
int columns = texture_size - texture_size % 199;
int c_pos = gl_VertexID * 199;
ivec2 texPos = ivec2(c_pos % columns, c_pos / columns);
vec4 tmp = vec3(0.0);
for (int i = 0; i < 199; i++) {
tmp += texelFetch(principal_components, texPos, 0) * coefficients[i];
texPos.x++;
}
gl_Position = rotation_matrix
* vec4(vin_position + tmp.xyz, 246006.0);
vin_pos = gl_Position;
}
The main idea of difference between them:
in the first case attributes of vertices are stored in following way:
first attributes of all vertices
second attributes of all vertices
...
last attributes of all vertices
in the second case attributes of vertices are stored in another way:
all attributes of the first vertex
all attributes of the second vertex
...
all attributes of the last vertex
also in the second example data is aligned so that all attributes of each vertex stored only in one row. This means that if I know the row and column of the first attribute of some vertex, I need only to increment x component of texture coordinate
I thought, that aligned data will be accessed faster.
Questions
Why is data not accessed faster?
How can I increase performance of it?
Is there ability to link texture chunk with vertex?
Are there recommendations for data alignment, good related article about caching in GPUs (Intel HD, nVidia GeForce)?
Notes
coefficients array changed from frame to frame, otherwise there's no problem: I could precalculate the model and be happy

Why is data not accessed faster?
Because GPUs are not magical. GPUs gain performance by performing calculations in parallel. Performing 1 million texel fetches, no matter how it happens, is not going to be fast.
If you were using the results of those textures to do lighting computations, it would appear fast because the cost of the lighting computation would be hidden by the latency of the memory fetches. You are taking the results of a fetch, doing a multiply/add, then doing another fetch. That's slow.
Is there ability to link texture chunk with vertex?
Even if there was (and there isn't), how would that help? GPUs execute operations in parallel. That means multiple vertices are being processed simultaneously, each accessing 200 textures.
So what would aid performance there is making each texture access coherent. That is, neighboring vertices would access neighboring texels, thus making the texture fetches more cache efficient. But there's no way to know what vertices will be considered "neighbors". And texture swizzle layouts are implementation dependent, so even if you did know the order of vertex processing, you couldn't adjust your texture to take local advantage of it.
The best way to do that would be to ditch vertex shaders and texture accesses in favor of compute shaders and SSBOs. That way, you have direct knowledge of the locality of your accesses, by setting the work group size. With SSBOs, you can arrange your array in whatever fashion gives you the best locality of access for each wavefront.
But things like this are the equivalent of putting band-aids on a gaping wound.
How can I increase performance of it?
Stop doing so many texture fetches.
I'm being completely serious. While there are ways to mitigate the costs of what you're doing, the most effective solution is to change your algorithm so that it doesn't need to do that much work.
Your algorithm looks suspiciously like vertex morphing via a palette of "poses", with the coefficient specifying the weight applied to each pose. If that's the case, then odds are good that most of your coefficients are either 0 or negligibly small. If so, then you're wasting vast amounts of time accessing textures only to transform their contributions into nothing.
If most of your coefficients are 0, then the best thing to do would be to pick some arbitrary and small number for the maximum number of coefficients that can affect the result. For example, 8. You send an array of 8 indices and coefficients to the shader as uniforms. Then you walk that array, fetching only 8 times. And you might be able to get away with just 4.

Nearest Neighbors in CUDA Particles

Edit 2: Please take a look at this crosspost for TLDR.
Edit: Given that the particles are segmented into grid cells (say 16^3 grid), is it a better idea to let run one work-group for each grid cell and as many work-items in one work-group as there can be maximal number of particles per grid cell?
In that case I could load all particles from neighboring cells into local memory and iterate through them computing some properties. Then I could write specific value into each particle in the current grid cell.
Would this approach be beneficial over running the kernel for all particles and for each iterating over (most of the time the same) neighbors?
Also, what is the ideal ratio of number of particles/number of grid cells?
I'm trying to reimplement (and modify) CUDA Particles for OpenCL and use it to query nearest neighbors for every particle. I've created the following structures:
Buffer P holding all particles' 3D positions (float3)
Buffer Sp storing int2 pairs of particle ids and their spatial hashes. Sp is sorted according to the hash. (The hash is just a simple linear mapping from 3D to 1D – no Z-indexing yet.)
Buffer L storing int2 pairs of starting and ending positions of particular spatial hashes in buffer Sp. Example: L[12] = (int2)(0, 50).
L[12].x is the index (in Sp) of the first particle with spatial hash 12.
L[12].y is the index (in Sp) of the last particle with spatial hash 12.
Now that I have all these buffers, I want to iterate through all the particles in P and for each particle iterate through its nearest neighbors. Currently I have a kernel that looks like this (pseudocode):
__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
size_t gid = get_global_id(0);
float3 curr_particle = P[gid];
int processed_value = 0;
for(int x=-1; x<=1; x++)
for(int y=-1; y<=1; y++)
for(int z=-1; z<=1; z++) {
float3 neigh_position = curr_particle + (float3)(x,y,z)*GRID_CELL_SIDE;
// ugly boundary checking
if ( dot(neigh_position<0, (float3)(1)) +
dot(neigh_position>BOUNDARY, (float3)(1)) != 0)
continue;
int neigh_hash = spatial_hash( neigh_position );
int2 particles_range = L[ neigh_hash ];
for(int p=particles_range.x; p<particles_range.y; p++)
processed_value += heavy_computation( P[ Sp[p].y ] );
}
Out[gid] = processed_value;
}
The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y] in the inner-most for loop) to be causing the slowness.
What I want to do is to use Z-order curve as the spatial hash. That way I could have only 1 for loop iterating through a continuous range of memory when querying neighbors. The only problem is that I don't know what should be the start and stop Z-index values.
The holy grail I want to achieve:
__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
size_t gid = get_global_id(0);
float3 curr_particle = P[gid];
int processed_value = 0;
// How to accomplish this??
// `get_neighbors_range()` returns start and end Z-index values
// representing the start and end near neighbors cells range
int2 nearest_neighboring_cells_range = get_neighbors_range(curr_particle);
int first_particle_id = L[ nearest_neighboring_cells_range.x ].x;
int last_particle_id = L[ nearest_neighboring_cells_range.y ].y;
for(int p=first_particle_id; p<=last_particle_id; p++) {
processed_value += heavy_computation( P[ Sp[p].y ] );
}
Out[gid] = processed_value;
}

You should study the Morton Code algorithms closely. Ericsons Real time collision detection explains that very well.
Ericson - Real time Collision detection
Here is another nice explanation including some tests:
Morton encoding/decoding through bit interleaving: Implementations
Z-Order algorithms only defines the paths of the coordinates in which you can hash from 2 or 3D coordinates to just an integer. Although the algorithm goes deeper for every iteration you have to set the limits yourself. Usually the stop index is denoted by a sentinel. Letting the sentinel stop will tell you at which level the particle is placed. So the maximum level you want to define will tell you the number of cells per dimension. For example with maximum level at 6 you have 2^6 = 64. You will have 64x64x64 cells in your system (3D). That also means that you have to use integer based coordinates. If you use floats you have to convert like coord.x = 64*float_x and so on.
If you know how many cells you have in your system you can define your limits. Are you trying to use a binary octree?
Since particles are in motion (in that CUDA example) you should try to parallelize over the number of particles instead of cells.
If you want to build lists of nearest neighbours you have to map the particles to cells. This is done through a table that is sorted afterwards by cells to particles. Still you should iterate through the particles and access its neighbours.
About your code:
The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y] in the inner-most for loop) to be causing the slowness.
Remember Donald Knuth. You should measure where the bottle neck is. You can use NVCC Profiler and look for bottleneck. Not sure what OpenCL has as profiler.
// ugly boundary checking
if ( dot(neigh_position<0, (float3)(1)) +
dot(neigh_position>BOUNDARY, (float3)(1)) != 0)
continue;
I think you should not branch it that way, how about returning zero when you call heavy_computation. Not sure, but maybe you have sort of a branch prediction here. Try to remove that somehow.
Running parallel over the cells is a good idea only if you have no write accesses to the particle data, otherwise you will have to use atomics. If you go over the particle range instead you read accesses to the cells and neighbours but you create your sum in parallel and you are not forced to some race condiction paradigm.
Also, what is the ideal ratio of number of particles/number of grid cells?
Really depends on your algorithms and the particle packing within your domain, but in your case I would define the cell size equivalent to the particle diameter and just use the number of cells you get.
So if you want to use Z-order and achieve your holy grail, try to use integer coordinates and hash them.
Also try to use larger amounts of particles. About 65000 particles like CUDA examples uses you should consider because that way the parallelisation is mostly efficient; the running processing units are exploited (fewer idles threads).

Dispatching more than 65535 threads

I'm attempting to skin vertices using DirectCompute. The method of skinning employed is such that you can have a variable amount of weights influencing each vertex (e.g. Md5 meshes are defined this way).
Basically inputs to the compute shader are.
JointsBuffer { float4 orientation, float4 position } Structured buffer SRV
WeightsBuffer { float3 normal, float4 position, float bias, uint jointIndex } Structured buffer SRV
VerticesBuffer { float2 texcoords, uint weightIndex, uint numWeights } Structured buffer SRV
and the output is
SkinnedVerticesBuffer { float3 normal, float4 position, float2 texcoord } Structured buffer UAV
Now the compute shader should be run once per element in the vertex buffer, and using SV_DispatchThreadID the shader attempts to populate the corresponding SkinnedVertex in the SkinnedVerticesBuffer for every Vertex in the VerticesBuffer ( 1:1 correspondence ).
So the problem is that many meshes have greater than 65535 vertices, and the DispatchThreadID command only allows for dispatching that many threads per dimension. Now I can theoretically write something that divides a lot of numbers up into a combination of three factors less than 65535, but I can't possibly do that for prime numbers.
So for example when some mesh with 71993 ( a prime number ) of vertices comes up I can't think of a way to handle it.
I can't over dispatch say 72000 threads with context->Dispatch( 36000, 2, 0 ), because then DispatchThreadID will run out of my buffer bounds.
Right now I'm leaning towards a constant buffer holding the amount of vertices, and then over dispatching to the nearest power of 2 and then simply doing
if( SV_DispatchThreadID > numVertices ) return;
Is this my only option? Anyone else run into this snag.

I've never. But 65000 threads seems like an awful lot.
Then, when I try to find documentation it seems that the values you pass are not threads, but thread groups. Someone on gamedev seems to have performance issues when passing a number as great as 768, so it seems to me that you will have to decrease that huge number.
I'm not sure, but I got the feeling you're misinterpreting these parameters. Try to read again what these values actually mean. (Just a layman's gut feeling, though.)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio