Optimizing Metal Compute - texture sampling using Gather? - gpgpu

Im attempting to optimize a compute shader that calculates some values from texture samples, and uses atomic operations to increment counters to a buffer, very similar to the following answer:
https://stackoverflow.com/a/68076730/5510818
kernel void compute(texture2d<half, access::read> inTexture [[texture(0)]],
volatile device atomic_uint *samples [[buffer(0)]],
ushort2 position [[thread_position_in_grid]])
{
// Early bail
if ( position.x >= inTexture.get_width() || position.y >= inTexture.get_height() )
{
return;
}
half3 color = inTexture.read(position).rgb;
// do some math here
// increment
atomic_fetch_add_explicit( &( samples[offset] ), uint32_t( somevalue ), memory_order_relaxed );
And part of my encoder on obj-c:
NSUInteger w = self.pass1PipelineState.threadExecutionWidth;
NSUInteger h = self.pass1PipelineState.maxTotalThreadsPerThreadgroup / w;
MTLSize threadsPerThreadGroup = MTLSizeMake(w, h, 1);
MTLSize threadsPerGrid = MTLSizeMake(frameMPSImage.width, frameMPSImage.height, 1);
[pass1Encoder dispatchThreads:threadsPerGrid threadsPerThreadgroup:threadsPerThreadGroup];
In an attempt to optimize, I am curious if I can leverage texture gather operations.
My understanding is that gather will fetch 4 samples 'about' the thread position in grid - and that it does so in an optimal manner. Am I right in understanding that I could in theory optimize this by fetching via gather, and doing 4x compute in my kernel, and write out 4x from a single thread group?
I would have to ensure that my thread width and height in metal passed to the encoder ensures I don't duplicate work (ie / 4 ?)
Something like:
kernel void compute(texture2d<half, access::read> inTexture [[texture(0)]],
volatile device atomic_uint *samples [[buffer(0)]],
ushort2 position [[thread_position_in_grid]])
{
// Early bail
if ( position.x >= inTexture.get_width() || position.y >= inTexture.get_height() )
{
return;
}
vec4<half3> colorGather = inTexture.gather(position).rgb;
color1 = half3[0]
// do some math here
color2 = half3[1]
// do some math here
color3 = half3[2]
// do some math here
color4 = half3[3]
// do some math here
// increment 4x
atomic_fetch_add_explicit( &( samples[offset1] ), uint32_t( somevalue1 ), memory_order_relaxed );
atomic_fetch_add_explicit( &( samples[offset2] ), uint32_t( somevalue2 ), memory_order_relaxed );
atomic_fetch_add_explicit( &( samples[offset3] ), uint32_t( somevalue3 ), memory_order_relaxed );
atomic_fetch_add_explicit( &( samples[offset4] ), uint32_t( somevalue4 ), memory_order_relaxed );
Am I understanding gather correctly?
Are there any publicly available examples of gather? I cannot seem to find any!
Is there a way to do a mutex lock about the buffer so I am not locking 4x in the above code?
Am I correctly understanding needing to adjust my obj-c encoder pass to account for the fact I'd be sampling 4x in the shader?
Thank you.

Related

How to calculate the needed velocity vector to fire an arrow to hit a certain point/

Im using Oimo.js Physics library with 3 js.
I fire my arrow at a target but my math doesn't seem to be right and I'm having trouble remembering exactly how all the kinematic formulas works.
I have an attack function which creates a projectile and fires it with a 3d vector. But its not behaving how I thought it would and ended up needing to hard code a y value which doesn't really work either. Can someone point me in the correct direction? I also want the arrow to have a slight arc in its trajectory.
public attack( target: Unit, isPlayer: boolean ): void {
let collisionGroup = isPlayer ? CollisionGroup.PLAYER_PROJECTILES : CollisionGroup.ENEMY_PROJECTILES;
let collidesWithGroup = isPlayer ? CollidesWith.PLAYER_PROJECTILES : CollidesWith.ENEMY_PROJECTILES;
this.model.lookAt( target.position );
let direction: Vector3 = new Vector3( 0, 0, 0 );
direction = this.model.getWorldDirection( direction );
let value = this.calculateVelocity();
let velocity = new Vector3( direction.x * value, Math.sin( _Math.degToRad( 30 ) ) * value, direction.z * value );
let arrow = this.gameWorld.addProjectile( 'arrow3d', 'box', false, new Vector3( this.model.position.x, 5, this.model.position.z ), new Vector3( 0, 0, 0 ), false, collisionGroup, collidesWithGroup );
arrow.scale = new Vector3( 10, 10, 5 );
arrow.setVelocity( velocity );
this.playAnimation( 'attack', false );
}
protected calculateVelocity(): number {
return Math.sqrt( -2 * ( -9.8 / 60 ) * this.distanceToTarget );
}
Im dividing by 60 because of the oimo.js timestep.

three.js: Get updated vertices with skeletal animations?

Similar to the question in this stack overflow question Three.js: Get updated vertices with morph targets I am interested in how to get the "actual" position of the vertices of a mesh with a skeletal animation.
I have tried printing out the position values, but they are never actually updated (as I am to understand, this is because they are calculated on the GPU, not CPU). The answer to the question above mentioned doing the same computations on the CPU as on the GPU to get the up to date vertex positions for morph target animations, but is there a way to do this same approach for skeletal animations? If so, how??
Also, for the morph targets, someone pointed out that this code is already present in the Mesh.raycast function (https://github.com/mrdoob/three.js/blob/master/src/objects/Mesh.js#L115 ). However, I don't see HOW the raycast works with skeletal animation meshes-- how does it know the updated position of the faces?
Thank you!
A similar topic was discussed in the three.js forum some time ago. I've presented there a fiddle which computes the AABB for a skinned mesh per frame. The code actually performs the same vertex displacement via JavaScript like in the vertex shader. The routine looks like so:
function updateAABB( skinnedMesh, aabb ) {
var skeleton = skinnedMesh.skeleton;
var boneMatrices = skeleton.boneMatrices;
var geometry = skinnedMesh.geometry;
var index = geometry.index;
var position = geometry.attributes.position;
var skinIndex = geometry.attributes.skinIndex;
var skinWeigth = geometry.attributes.skinWeight;
var bindMatrix = skinnedMesh.bindMatrix;
var bindMatrixInverse = skinnedMesh.bindMatrixInverse;
var i, j, si, sw;
aabb.makeEmpty();
//
if ( index !== null ) {
// indexed geometry
for ( i = 0; i < index.count; i ++ ) {
vertex.fromBufferAttribute( position, index[ i ] );
skinIndices.fromBufferAttribute( skinIndex, index[ i ] );
skinWeights.fromBufferAttribute( skinWeigth, index[ i ] );
// the following code section is normally implemented in the vertex shader
vertex.applyMatrix4( bindMatrix ); // transform to bind space
skinned.set( 0, 0, 0 );
for ( j = 0; j < 4; j ++ ) {
si = skinIndices.getComponent( j );
sw = skinWeights.getComponent( j );
boneMatrix.fromArray( boneMatrices, si * 16 );
// weighted vertex transformation
temp.copy( vertex ).applyMatrix4( boneMatrix ).multiplyScalar( sw );
skinned.add( temp );
}
skinned.applyMatrix4( bindMatrixInverse ); // back to local space
// expand aabb
aabb.expandByPoint( skinned );
}
} else {
// non-indexed geometry
for ( i = 0; i < position.count; i ++ ) {
vertex.fromBufferAttribute( position, i );
skinIndices.fromBufferAttribute( skinIndex, i );
skinWeights.fromBufferAttribute( skinWeigth, i );
// the following code section is normally implemented in the vertex shader
vertex.applyMatrix4( bindMatrix ); // transform to bind space
skinned.set( 0, 0, 0 );
for ( j = 0; j < 4; j ++ ) {
si = skinIndices.getComponent( j );
sw = skinWeights.getComponent( j );
boneMatrix.fromArray( boneMatrices, si * 16 );
// weighted vertex transformation
temp.copy( vertex ).applyMatrix4( boneMatrix ).multiplyScalar( sw );
skinned.add( temp );
}
skinned.applyMatrix4( bindMatrixInverse ); // back to local space
// expand aabb
aabb.expandByPoint( skinned );
}
}
aabb.applyMatrix4( skinnedMesh.matrixWorld );
}
Also, for the morph targets, someone pointed out that this code is already present in the Mesh.raycast function
Yes, you can raycast against morphed meshes. Raycasting against skinned meshes is not supported yet. The code in Mesh.raycast() is already very complex. I think it needs some serious refactoring before it is further enhanced. In the meantime, you can use the presented code snippet to build a solution by yourself. The vertex displacement logic is actually the most complicated part.
Live demo: https://jsfiddle.net/fnjkeg9x/1/
three.js R107

Why does this wobble?

Tested on Processing 2.2.1 & 3.0a2 on OS X.
The code I've tweaked below may look familiar to some of you, it's what Imgur now uses as their loading animation. It was posted on OpenProcessing.org and I've been able to get it working in Processing, but the arcs are constantly wobbling around (relative movement within 1 pixel). I'm new to Processing and I don't see anything in the sketch that could be causing this, it runs in ProcessingJS without issue (though very high CPU utilization).
int num = 6;
float step, spacing, theta, angle, startPosition;
void setup() {
frameRate( 60 );
size( 60, 60 );
strokeWeight( 3 );
noFill();
stroke( 51, 51, 51 );
step = 11;
startPosition = -( PI / 2 );
}
void draw() {
background( 255, 255, 255, 0 );
translate( width / 2, height / 2 );
for ( int i = 0; i < num; i++ ) {
spacing = i * step;
angle = ( theta + ( ( PI / 4 / num ) * i ) ) % PI;
float arcEnd = map( sin( angle ), -1, 1, -TWO_PI, TWO_PI );
if ( angle <= ( PI / 2 ) ) {
arc( 0, 0, spacing, spacing, 0 + startPosition , arcEnd + startPosition );
}
else {
arc( 0, 0, spacing, spacing, TWO_PI - arcEnd + startPosition , TWO_PI + startPosition );
}
}
arc( 0, 0, 1, 1, 0, TWO_PI );
theta += .02;
}
If it helps, I'm trying to export this to an animated GIF. I tried doing this with ProcessingJS and jsgif, but hit some snags. I'm able to get it exported in Processing using gifAnimation just fine.
UPDATE
Looks like I'm going with hint( ENABLE_STROKE_PURE );, cleaned up with strokeCap( SQUARE ); within setup(). It doesn't look the same as the original but I do like the straight edges. Sometimes when you compromise, the result ends up even better than the "ideal" solution.
I see the problem on 2.2.1 for OS X, and calling hint(ENABLE_STROKE_PURE) in setup() fixes it for me. I couldn't find good documentation for this call, though; it's just something that gets mentioned here and there.
As for the root cause, if I absolutely had to speculate, I'd guess that Processing's Java renderer approximates a circular arc using a spline with a small number of control points. The control points are spaced out between the endpoints, so as the endpoints move, so do the bumps in the approximation. The approximation might be good enough for a single frame, but the animation makes the bumps obvious. Setting ENABLE_STROKE_PURE might increase the number of control points, or it might force Processing to use a more expensive circular arc primitive in the underlying graphics library it's built upon. Again, though, this is just a guess as to why a drawing environment might have a bug like the one you've seen. I haven't read Processing's source code to verify the guess.

Why is my GLFW window so slow?

For some reason the following code produces a result such as this:
I have no idea why it takes a second or two to render each frame. The code seems to be normal to me, but I have no idea why it renders it so slowly.
#define GLFW_INCLUDE_GLU
#define GLEW_STATIC
#include"GL/glew.c"
#include"GLFW/glfw3.h"
#include<cmath>
#include<ctime>
#include<stdlib.h>
using namespace std;
int main( ) {
// init glfw
if ( !glfwInit( ) ) return 0;
// create window and set active context
GLFWwindow* window = glfwCreateWindow( 640, 480, "Test", NULL, NULL );
glfwMakeContextCurrent( window );
// init glew
if ( glewInit( ) != GLEW_OK ) return 0;
// set swap interval
glfwSwapInterval( 1 );
// render loop
while ( !glfwWindowShouldClose( window ) ) {
srand( time( NULL ) );
float r = ( rand( ) % 100 ) / 100.0;
float ratio;
int width, height;
glfwGetFramebufferSize( window, &width, &height );
ratio = width / ( float ) height;
glViewport( 0, 0, width, height );
// render
glClear( GL_COLOR_BUFFER_BIT );
glClearColor( r, 0, 0, 1 );
// swap
glfwSwapBuffers( window );
// events
glfwPollEvents( );
}
// terminate glfw
glfwTerminate( );
return 0;
}
Your GLFW code is correct, and it is performing much faster than you think.
This is incorrect usage of rand and srand. More concretely, you srand with the current time (measured in seconds) every time you render, which will, within the same second, produce exactly the same value for r every time.
You therefore clear the screen to the same color several hundred times per second. This looks like it only renders at one frame per second, but it really isn't.
There are a few other problems with your code (such as using rand()%100 which gives a biased distribution. and using some OpenGL commands redundantly), but the one thing you need to fix for your immediate problem is: srand only once, not every time you render.

Core Animation / GLES2.0: Getting bitmap from CALayer into GL Texture

I am rewriting in GL some core animation code that wasn't performing fast enough.
on my previous version each button was represented by a CALayer, containing sublayers for the overall shape and the text content.
what I would like to do is set .setRasterize = YES on this layer, force it to render onto its own internal bitmap, then send that over to my GL code, currently:
// Create A 512x512 greyscale texture
{
// MUST be power of 2 for W & H or FAILS!
GLuint W = 512, H = 512;
printf("Generating texture: [%d, %d]\n", W, H);
// Create a pretty greyscale pixel pattern
GLubyte *P = calloc( 1, ( W * H * 4 * sizeof( GLubyte ) ) );
for ( GLuint i = 0; ( i < H ); ++i )
{
for ( GLuint j = 0; ( j < W ); ++j )
{
P[( ( i * W + j ) * 4 + 0 )] =
P[( ( i * W + j ) * 4 + 1 )] =
P[( ( i * W + j ) * 4 + 2 )] =
P[( ( i * W + j ) * 4 + 3 )] = ( i ^ j );
}
}
// Ask GL to give us a texture-ID for us to use
glGenTextures( 1, & greyscaleTexture );
// make it the ACTIVE texture, ie functions like glTexImage2D will
// automatically know to use THIS texture
glBindTexture( GL_TEXTURE_2D, greyscaleTexture );
// set some params on the ACTIVE texture
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST );
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST );
// WRITE/COPY from P into active texture
glTexImage2D( GL_TEXTURE_2D, 0, GL_RGBA, W, H, 0, GL_RGBA, GL_UNSIGNED_BYTE, P );
free( P );
glLogAndFlushErrors();
}
could someone help me patch this together?
EDIT: I actually want to create a black and white mask, so every pixel would either be 0x00 or 0xFF, then I can make a bunch of quads, and for each quad I can set all of its vertices to a particular colour. hence I can easily get different coloured buttons from the same stencil...
http://iphone-3d-programming.labs.oreilly.com/ch05.html#GeneratingTexturesWithQuartz
GLSprite example here,
http://developer.apple.com/library/ios/navigation/#section=Frameworks&topic=OpenGLES

Resources