How to pass large array from three.js to a vertex shader? - three.js

Wondering if it's possible to pass a large array into a WebGL shader, like this:
// array here
uniform vec3[hugeSize] arrayOfStars;
void main() {
// iterate through the array here to compare positions with current particule
gl_Position = ...;
}
I would like iterate a vector array which contains the positions of 1 million particles.
I would like for each particle to compare its position with those of the others (or a percentage of the others for better performance) in order to calculate its next position.
(I'm trying to roughly simulate a galaxy)
My problem is that I can't put an array with a size greater than 4075, while I would need an array from vec3 with a size of 1,000,000.

Related

How to get regular float32 geometry of a mesh compressed using gltfpack

I used mesh decimation using gltfpack with the command
gltfpack -i input.glb -o output.glb -si 0.01
this reduces my mesh geometry triangles by 99 percent.
Now my output.glb has geometry.position as an interleaved Buffer Attribute of data type unsigned Int 16. I am using ammo.js to make its physics body, which requires geometry as a regular float32 array. But my attempts to convert it by dividing by 2^32 - 1, 2^31 - 1 have failed.
I get the body but positioned with different size and position, and it doesn't align with the threejs rendered model.
Is there a way to a regular float32 array so that I can pass it to the createConvexHullPhysicsShape function of Ammo.js
I've added a sandBox that shows the issue. issue
Steps for quantizing and de-quantizing a normalized array can be found in the KHR_mesh_quantization specification, equivalent to what hardware APIs will do. For int16 that would be:
float = max(int / 32767.0, -1.0)
This does assume that the values are normalized. In gltfpack I think that is an option. You can check attribute.normalized in three.js. If it is false there is no need for conversion, you can just use the values as-is.
Finally, note that if the buffer is interleaved, you will have to de-interleave, as the buffer might contain vertex attributes other than position.
UPDATE: September 9, 2022
The positions this GLB do not actually appear to be normalized, in the WebGL sense. They are large integers, and there's an inverse scaling on the parent nodes that brings them back down into a smaller floating point range. You'll need to account for that when extracting positions from the Mesh to create a convex hull. For example —
import { deinterleaveAttribute } from "three/examples/jsm/utils/BufferGeometryUtils.js";
function getVertexPositions(root) {
const positions = [];
root.updateMatrixWorld(true);
root.traverse((obj) => {
if (obj.type !== "Mesh") return;
let position = obj.geometry.attributes.position;
// de-interleave the array
position = deinterleaveAttribute(position);
// cast from Int16 to Float32
position = new THREE.BufferAttribute(
new Float32Array(position.array),
position.itemSize,
false
);
// apply mesh scaling to the positions array.
position.applyMatrix4(obj.matrixWorld);
positions.push(position);
});
return positions;
};
Result:

How can I append elements to a 3-dimensional array in Processing (v. 3.4)?

I am creating a program to render 3D graphics. I have a 3D array 'shapes' which contains all of the polygons to render. It is an array of polygons, where each polygon is itself an array of points, and each point is an array of 3 integer values (x, y, z co-ordinates). I have tried and failed to use the append() function. How else can I get it to work?
I've tried using the append() function, but this seems to not work with multidimensional arrays.
int[][][] addPolyhedron(int[][][] shapes, int[][][] polyhedron)
{
for(int i = 0; i < polyhedron.length; i ++)
{
shapes = append(shapes, polyhedron[i]);
{
return shapes;
}
I wanted this to extend the array shapes to include all of the polygons in the array polyhedron. However, I receive an error message saying 'type mismatch, "java.lang.Object" does not match with "int[][][]".' Thanks in advance.
In Java, arrays (of any dimension) are not extendable - the size is defined, allocated and fixed upon instantiation. You want to add to (and therefore dynamically resize) shapes. Although Processing does provide the append() function, I think it is more appropriate to use the ArrayList built-in Java data type.
Your function could be refactored into something like this:
ArrayList<Integer[][]> addPolyhedron(ArrayList<Integer[][]> shapes, ArrayList<Integer[][]> polyhedron)
{
shapes.addAll(polyhedron);
return shapes;
}
Note that int[][] has become Integer[][] because an ArrayList cannot be declared with primitive types (int, bool, float, etc.).
Adding an individual program-defined polygon to shapes would be done like this:
shapes.add(new Integer[][] {{1,2,5},{3,4,5},{6,5,4}}); // adds a triangle to shapes.
Getting coordinates from the shapes ArrayList would be done like this:
shapes.get(0)[1][1]; // returns 4.

How to improve texture access performance in OpenGL shaders?

Conditions
I use OpenGL 3 and PyOpenGL.
I have ~50 thousand (53'490) vertices and each of them has 199 vec3 attributes which determine their displacement. It's impossible to store this data as regular vertices attributes, so I use texture.
The problem is: non-parallelized C function calculates displacement of vertices as fast as GLSL and even faster in some cases. I've checked: the issue is texture read and I don't understand how to optimize it.
I've written two different shaders. One calculates new model in ~0.09s and another one in ~0.12s (including attributes assignment, which is equal for both cases).
Code
Both shaders start with
#version 300 es
in vec3 vin_position;
out vec4 vin_pos;
uniform mat4 rotation_matrix;
uniform float coefficients[199];
uniform sampler2D principal_components;
The faster one is
void main(void) {
int c_pos = gl_VertexID;
int texture_size = 8192;
ivec2 texPos = ivec2(c_pos % texture_size, c_pos / texture_size);
vec4 tmp = vec4(0.0);
for (int i = 0; i < 199; i++) {
tmp += texelFetch(principal_components, texPos, 0) * coefficients[i];
c_pos += 53490;
texPos = ivec2(c_pos % texture_size, c_pos / texture_size);
}
gl_Position = rotation_matrix
* vec4(vin_position + tmp.xyz, 246006.0);
vin_pos = gl_Position;
}
The slower one
void main(void) {
int texture_size = 8192;
int columns = texture_size - texture_size % 199;
int c_pos = gl_VertexID * 199;
ivec2 texPos = ivec2(c_pos % columns, c_pos / columns);
vec4 tmp = vec3(0.0);
for (int i = 0; i < 199; i++) {
tmp += texelFetch(principal_components, texPos, 0) * coefficients[i];
texPos.x++;
}
gl_Position = rotation_matrix
* vec4(vin_position + tmp.xyz, 246006.0);
vin_pos = gl_Position;
}
The main idea of difference between them:
in the first case attributes of vertices are stored in following way:
first attributes of all vertices
second attributes of all vertices
...
last attributes of all vertices
in the second case attributes of vertices are stored in another way:
all attributes of the first vertex
all attributes of the second vertex
...
all attributes of the last vertex
also in the second example data is aligned so that all attributes of each vertex stored only in one row. This means that if I know the row and column of the first attribute of some vertex, I need only to increment x component of texture coordinate
I thought, that aligned data will be accessed faster.
Questions
Why is data not accessed faster?
How can I increase performance of it?
Is there ability to link texture chunk with vertex?
Are there recommendations for data alignment, good related article about caching in GPUs (Intel HD, nVidia GeForce)?
Notes
coefficients array changed from frame to frame, otherwise there's no problem: I could precalculate the model and be happy
Why is data not accessed faster?
Because GPUs are not magical. GPUs gain performance by performing calculations in parallel. Performing 1 million texel fetches, no matter how it happens, is not going to be fast.
If you were using the results of those textures to do lighting computations, it would appear fast because the cost of the lighting computation would be hidden by the latency of the memory fetches. You are taking the results of a fetch, doing a multiply/add, then doing another fetch. That's slow.
Is there ability to link texture chunk with vertex?
Even if there was (and there isn't), how would that help? GPUs execute operations in parallel. That means multiple vertices are being processed simultaneously, each accessing 200 textures.
So what would aid performance there is making each texture access coherent. That is, neighboring vertices would access neighboring texels, thus making the texture fetches more cache efficient. But there's no way to know what vertices will be considered "neighbors". And texture swizzle layouts are implementation dependent, so even if you did know the order of vertex processing, you couldn't adjust your texture to take local advantage of it.
The best way to do that would be to ditch vertex shaders and texture accesses in favor of compute shaders and SSBOs. That way, you have direct knowledge of the locality of your accesses, by setting the work group size. With SSBOs, you can arrange your array in whatever fashion gives you the best locality of access for each wavefront.
But things like this are the equivalent of putting band-aids on a gaping wound.
How can I increase performance of it?
Stop doing so many texture fetches.
I'm being completely serious. While there are ways to mitigate the costs of what you're doing, the most effective solution is to change your algorithm so that it doesn't need to do that much work.
Your algorithm looks suspiciously like vertex morphing via a palette of "poses", with the coefficient specifying the weight applied to each pose. If that's the case, then odds are good that most of your coefficients are either 0 or negligibly small. If so, then you're wasting vast amounts of time accessing textures only to transform their contributions into nothing.
If most of your coefficients are 0, then the best thing to do would be to pick some arbitrary and small number for the maximum number of coefficients that can affect the result. For example, 8. You send an array of 8 indices and coefficients to the shader as uniforms. Then you walk that array, fetching only 8 times. And you might be able to get away with just 4.

Nearest Neighbors in CUDA Particles

Edit 2: Please take a look at this crosspost for TLDR.
Edit: Given that the particles are segmented into grid cells (say 16^3 grid), is it a better idea to let run one work-group for each grid cell and as many work-items in one work-group as there can be maximal number of particles per grid cell?
In that case I could load all particles from neighboring cells into local memory and iterate through them computing some properties. Then I could write specific value into each particle in the current grid cell.
Would this approach be beneficial over running the kernel for all particles and for each iterating over (most of the time the same) neighbors?
Also, what is the ideal ratio of number of particles/number of grid cells?
I'm trying to reimplement (and modify) CUDA Particles for OpenCL and use it to query nearest neighbors for every particle. I've created the following structures:
Buffer P holding all particles' 3D positions (float3)
Buffer Sp storing int2 pairs of particle ids and their spatial hashes. Sp is sorted according to the hash. (The hash is just a simple linear mapping from 3D to 1D – no Z-indexing yet.)
Buffer L storing int2 pairs of starting and ending positions of particular spatial hashes in buffer Sp. Example: L[12] = (int2)(0, 50).
L[12].x is the index (in Sp) of the first particle with spatial hash 12.
L[12].y is the index (in Sp) of the last particle with spatial hash 12.
Now that I have all these buffers, I want to iterate through all the particles in P and for each particle iterate through its nearest neighbors. Currently I have a kernel that looks like this (pseudocode):
__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
size_t gid = get_global_id(0);
float3 curr_particle = P[gid];
int processed_value = 0;
for(int x=-1; x<=1; x++)
for(int y=-1; y<=1; y++)
for(int z=-1; z<=1; z++) {
float3 neigh_position = curr_particle + (float3)(x,y,z)*GRID_CELL_SIDE;
// ugly boundary checking
if ( dot(neigh_position<0, (float3)(1)) +
dot(neigh_position>BOUNDARY, (float3)(1)) != 0)
continue;
int neigh_hash = spatial_hash( neigh_position );
int2 particles_range = L[ neigh_hash ];
for(int p=particles_range.x; p<particles_range.y; p++)
processed_value += heavy_computation( P[ Sp[p].y ] );
}
Out[gid] = processed_value;
}
The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y] in the inner-most for loop) to be causing the slowness.
What I want to do is to use Z-order curve as the spatial hash. That way I could have only 1 for loop iterating through a continuous range of memory when querying neighbors. The only problem is that I don't know what should be the start and stop Z-index values.
The holy grail I want to achieve:
__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
size_t gid = get_global_id(0);
float3 curr_particle = P[gid];
int processed_value = 0;
// How to accomplish this??
// `get_neighbors_range()` returns start and end Z-index values
// representing the start and end near neighbors cells range
int2 nearest_neighboring_cells_range = get_neighbors_range(curr_particle);
int first_particle_id = L[ nearest_neighboring_cells_range.x ].x;
int last_particle_id = L[ nearest_neighboring_cells_range.y ].y;
for(int p=first_particle_id; p<=last_particle_id; p++) {
processed_value += heavy_computation( P[ Sp[p].y ] );
}
Out[gid] = processed_value;
}
You should study the Morton Code algorithms closely. Ericsons Real time collision detection explains that very well.
Ericson - Real time Collision detection
Here is another nice explanation including some tests:
Morton encoding/decoding through bit interleaving: Implementations
Z-Order algorithms only defines the paths of the coordinates in which you can hash from 2 or 3D coordinates to just an integer. Although the algorithm goes deeper for every iteration you have to set the limits yourself. Usually the stop index is denoted by a sentinel. Letting the sentinel stop will tell you at which level the particle is placed. So the maximum level you want to define will tell you the number of cells per dimension. For example with maximum level at 6 you have 2^6 = 64. You will have 64x64x64 cells in your system (3D). That also means that you have to use integer based coordinates. If you use floats you have to convert like coord.x = 64*float_x and so on.
If you know how many cells you have in your system you can define your limits. Are you trying to use a binary octree?
Since particles are in motion (in that CUDA example) you should try to parallelize over the number of particles instead of cells.
If you want to build lists of nearest neighbours you have to map the particles to cells. This is done through a table that is sorted afterwards by cells to particles. Still you should iterate through the particles and access its neighbours.
About your code:
The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y] in the inner-most for loop) to be causing the slowness.
Remember Donald Knuth. You should measure where the bottle neck is. You can use NVCC Profiler and look for bottleneck. Not sure what OpenCL has as profiler.
// ugly boundary checking
if ( dot(neigh_position<0, (float3)(1)) +
dot(neigh_position>BOUNDARY, (float3)(1)) != 0)
continue;
I think you should not branch it that way, how about returning zero when you call heavy_computation. Not sure, but maybe you have sort of a branch prediction here. Try to remove that somehow.
Running parallel over the cells is a good idea only if you have no write accesses to the particle data, otherwise you will have to use atomics. If you go over the particle range instead you read accesses to the cells and neighbours but you create your sum in parallel and you are not forced to some race condiction paradigm.
Also, what is the ideal ratio of number of particles/number of grid cells?
Really depends on your algorithms and the particle packing within your domain, but in your case I would define the cell size equivalent to the particle diameter and just use the number of cells you get.
So if you want to use Z-order and achieve your holy grail, try to use integer coordinates and hash them.
Also try to use larger amounts of particles. About 65000 particles like CUDA examples uses you should consider because that way the parallelisation is mostly efficient; the running processing units are exploited (fewer idles threads).

How do I rotate in object space in 3D (using Matrixes)

what I am trying to do is to set up functions that can perform global and object space rotations, but am having problems understand how to go about object space rotations, as just multiplying a point by the rotation only works for global space, so my idea was to build the rotation in object space, then multiply it by the inverse of the objects matrix, supposedly taking away all the excess rotation between object and global space, so still maintaining the object space rotation, but in global values, I was wrong in this logic, as it did not work, here is my code, if you want to inspect it, all functions it calls have been tested to work:
// build object space rotation
sf::Vector3<float> XMatrix (MultiplyByMatrix(sf::Vector3<float> (cosz,sinz,0)));
sf::Vector3<float> YMatrix (MultiplyByMatrix(sf::Vector3<float> (-sinz,cosz,0)));
sf::Vector3<float> ZMatrix (MultiplyByMatrix(sf::Vector3<float> (0,0,1)));
// build cofactor matrix
sf::Vector3<float> InverseMatrix[3];
CoFactor(InverseMatrix);
// multiply by the transpose of the cofactor matrix(the adjoint), to bring the rotation to world space coordinates
sf::Vector3<float> RelativeXMatrix = MultiplyByTranspose(XMatrix, InverseMatrix[0], InverseMatrix[1], InverseMatrix[2]);
sf::Vector3<float> RelativeYMatrix = MultiplyByTranspose(YMatrix, InverseMatrix[0], InverseMatrix[1], InverseMatrix[2]);
sf::Vector3<float> RelativeZMatrix = MultiplyByTranspose(ZMatrix, InverseMatrix[0], InverseMatrix[1], InverseMatrix[2]);
// perform the rotation from world space
PointsPlusMatrix(RelativeXMatrix, RelativeYMatrix, RelativeZMatrix);
The difference between rotation in world-space and object-space is where you apply the rotation matrix.
The usual way computer graphics uses matrices is to map vertex points:
from object-space, (multiply by MODEL matrix to transform)
into world-space, (then multiply by VIEW matrix to transform)
into camera-space, (then multiply by PROJECTION matrix to transform)
into projection-, or "clip"- space
Specifically, suppose points are represented as column vectors; then, you transform a point by left-multiplying it by a transformation matrix:
world_point = MODEL * model_point
camera_point = VIEW * world_point = (VIEW*MODEL) * model_point
clip_point = PROJECTION * camera_point = (PROJECTION*VIEW*MODEL) * model_point
Each of these transformation matrices may itself be the result of multiple matrices multiplied in sequence. In particular, the MODEL matrix is often composed of a sequence of rotations, translations, and scalings, based on a hierarchical articulated model, e.g.:
MODEL = STAGE_2_WORLD * BODY_2_STAGE *
SHOULDER_2_BODY * UPPERARM_2_SHOULDER *
FOREARM_2_UPPERARM * HAND_2_FOREARM
So, whether you are rotating in model-space or world-space depends on which side of the MODEL matrix you apply your rotation matrix. Of course, you can easily do both:
MODEL = WORLD_ROTATION * OLD_MODEL * OBJECT_ROTATION
In this case, WORLD_ROTATION rotates about the center of world-space, while OBJECT_ROTATION rotates about the center of object-space.

Resources