Gradient of a function in OpenCL - algorithm

I'm playing around a bit with OpenCL and I have a problem which can be simplified as follows.
I'm sure this is a common problem but I cannot find many references or examples that would show me how this is usually done
Suppose for example you have a function (writing in CStyle syntax)
float function(float x1, float x2, float x3, float x4, float x5)
{
return sin(x1) + x1*cos(x2) + x3*exp(-x3) + x4 + x5;
}
I can also implement the gradient of this function as
void functionGradient(float x1, float x2, float x3, float x4, float x5, float gradient[])
{
gradient[0] = cos(x1) + cos(x2);
gradient[1] = -sin(x2);
gradient[2] = exp(-x3) - x3*exp(-x3);
gradient[3] = 1.0f;
gradient[4] = 1.0f;
}
Now I was thinking of implementing an OpenCL C kernel function that would do the same thing, cause I wanted to speed this up. The only way I have in mind to do this is to assign to each workunit a component of the gradient but then I'd need to put a bunch of if statements within the code to figure which workunit is computing what component, which isn't good in general because of divergence.
Therefore here is the question, how is such problem tackled in general? I'm aware for example of Gradient Descent implementations on GPU, see machine learning with backpropagation for example. So I wonder what is generally done to avoid divergence in the code.
Follow up from suggestion
I'm thinking of a possible SIMD compatible implementation as follows:
/*
Pseudo OpenCL-C code
here weight is a 5x5 array containing weights in {0,1} masking the relevant
computation
*/
__kernel void functionGradient(float x1, float x2, float x3, float x4, float x5, __global float* weight,__global* float gradient)
{
size_t threadId = get_global_id(0);
gradient[threadId] =
weight[5*threadId]*(cos(x1) + cos(x2)) +
weight[5*threadId + 1]*(-sin(x2)) +
weight[5*threadId + 2]*(exp(-x3) - x3*exp(x3)) +
weight[5*threadId + 3] + weight[5*threadId + 4];
barrier(CLK_GLOBAL_MEM_FENCE);
}

If your gradient function only has 5 components, it does not make sense to parallelize it in a way that one thread does one component. As you mentioned, GPU parallelization does not work if the mathematical structure of each components is different (multiple instructionsmultiple data, MIMD).
If you would need to compute the 5-dimensional gradient at 100k different coordinates however, then each thread would do all 5 components for each coordinate and parallelization would work efficiently.
In the backpropagation example, you have one gradient function with thousands of dimensions. In this case you would indeed parallelize the gradient function itself such that one thread computes one component of the gradient. However in this case all gradient components have the same mathematical structure (with different weighting factors in global memory), so branching is not required. Each gradient component is the same equation with different numbers (single instruction multiple data, SIMD). GPUs are designed to only handle SIMD; this is also why they are so energy efficient (~30TFLOPs # 300W) compared to CPUs (which can do MIMD, ~2-3TFLOPs # 150W).
Finally, note that backpropagation / neural nets are specifically designed to be SIMD. Not every new algorithm you come across can be parallelize in this manner.
Coming back to your 5-dimensional gradient example: There are ways to make it SIMD-compatible without branching. Specifically bit-maskimg: You would compute 2 cosines (for componet 1 express the sine through cosine) and one exponent and add all the terms up with a factor in front of each. The terms that you don't need, you multiply by a factor 0. Lastly, the factors are functions of the component ID. However as mentioned above, this only makes sense if you have many thousands to millions of dimensions.
Edit: here the SIMD-compatible version with bit masking:
kernel void functionGradient(const global float x1, const global float x2, const global float x3, const global float x4, const global float x5, global float* gradient) {
const float gid = get_global_id(0);
const float cosx1 = cos(x1);
const float cosx2 = cos((gid!=1)*x2+(gid==1)*3.1415927f);
const float expmx3 = exp(-x3);
gradient[gid] = (gid==0)*cosx1 + (gid<=1)*cosx2 + (gid==2)*(expmx3-x3*expmx3) + (gid>=3);
}
Note that there is no additional global/local memory access and all the (mutually exclusive) weighting factors are functions of the gloal ID. Each thread computes exactly the same thing (2 cos, 1 exp and a fes multiplications/additions) without any branching. Trigonometric functions / divisions take much more time than multiplications/additions, so as few as possible should be used by pre-calculating terms.

Related

Dot product vs Direct vector components sum performance in shaders

I'm writing CG shaders for advanced lighting calculation for game based on Unity. Sometimes it is needed to sum all vector components. There are two ways to do it:
Just write something like:
float sum = v.x + v.y + v.z;
Or do something like:
float sum = dot(v,float3(1,1,1));
I am really curious about what is faster and looks better for code style.
It's obvious that if we have same question for CPU calculations, the first simle way is much better. Because of:
a) There is no need to allocate another float(1,1,1) vector
b) There is no need to multiply every original vector "v" components by 1.
But since we do it in shader code, which runs on GPU, I belive there is some great hardware optimization for dot product function, and may be allocation of float3(1,1,1) will be translated in no allocation at all.
float4 _someVector;
void surf (Input IN, inout SurfaceOutputStandard o){
float sum = _someVector.x + _someVector.y + _someVector.z + _someVector.w;
// VS
float sum2 = dot(_someVector, float4(1,1,1,1));
}
Check this link.
Vec3 Dot has a cost of 3 cycles, while Scalar Add has a cost of 1.
Thus, in almost all platforms (AMD and NVIDIA):
float sum = v.x + v.y + v.z; has a cost of 2
float sum = dot(v,float3(1,1,1)); has a cost of 3
The first implementation should be faster.
Implementation of the Dot product in cg: https://developer.download.nvidia.com/cg/dot.html
IMHO difference is immeasurable, in 98% of the cases, but first one should be faster, because multiplication is a "more expensive" operation

Convert a bivariate draw in a univariate draw in Matlab

I have in mind the following experiment to run in Matlab and I am asking for an help to implement step (3). Any suggestion would be very appreciated.
(1) Consider the random variables X and Y both uniformly distributed on [0,1]
(2) Draw N realisation from the joint distribution of X and Y assuming that X and Y are independent (meaning that X and Y are uniformly jointly distributed on [0,1]x[0,1]). Each draw will be in [0,1]x[0,1].
(3) Transform each draw in [0,1]x[0,1] in a draw in [0,1] using the Hilbert space filling curve: under the Hilbert curve mapping, the draw in [0,1]x[0,1] should be the image of one (or more because of surjectivity) point(s) in [0,1]. I want pick one of these points. Is there any pre-built package in Matlab doing this?
I found this answer which I don't think does what I want as it explains how to obtain the Hilbert value of the draw (curve length from the start of curve to the picked point)
On wikipedia I found this code in C language (from (x,y) to d) which, again, does not fulfil my question.
EDIT This answer does not address updated version of the question, which explicitly asks about constructing Hilbert curve. Instead, this answer addresses a related question on construction of bijective mapping, and the relation to uniform distribution.
Your problem in not really well defined. If you only need the resulting distribution to be uniform, nothing is stopping you from simply picking f:(X,Y)->X. Result would be uniform regardless of whether X and Y are correlated. From your post I can only presume that what you want, in fact, is for the resulting transformation to be bijective, or as close to it as possible given machine precision limitations.
Worth noting that unless you need the algorithm that is best in preserving locality (which is clearly not required for resulting distribution to be bijective, not to mention uniform), there's no need to bother constructing Hilbert curves that you mention in your question. They have just as much to do with the solution as any other space-filling curve, and are incredibly computationally intensive.
So assuming you're looking for a bijective mapping, your question is equivalent to asking whether the set of points in a [unit] square has the same cardinality as the set of points in a [unit] line segment, and if it is, how to construct that bijection, i.e. 1-to-1 correspondence. The intuition says the square should have a higher cardinality, and Cantor spent 3 years trying to prove that, eventually proving quite the opposite - that these sets are in fact equinumerous. He was so surprised at his discovery that he wrote:
I see it, but I don't believe it!
The most commonly referred to bijection, fulfilling** this criteria, is the following. Represent x and y in their decimal form, i.e. x=0. x1 x2 x3 x4 x5..., and y=0. y1 y2 y3 y4 y5..., and let f:(X,Y)->Z be z=0. x1 y1 x2 y2 x3 y3 x4 y4 x5 y5..., i.e. alternating the decimals of the two numbers. The idea behind the bijection is trivial, though a rigorous proof requires quite a bit of prior knowledge.
** The caveat is that if we take e.g. x = 1/3 = 0.33333... and y = 1/5 = 0.199999... = 0.200000..., we can see there are two sequences corresponding to them: z = 0.313939393939... and z = 0.323030303030.... To overcome this obstacle we have to prove that adding a countable set to an uncountable one does not change the cardinality of the latter.
In reality we have to deal with machine precision and not pure math, which strictly speaking means both sets are actually finite and hence not equinumerous (assuming you store result with the same precision as original numbers). Which means we're simply forced to do some assumptions and loose some information, such as, in this case, the last half of significant digits of x and y. That is, unless we use a different data type that allows to store result with a double precision, compared to that of original variables.
Finally, sample implementation in Matlab:
x = rand();
y = rand();
chars = [num2str(x, '%.17f'); num2str(y, '%.17f')];
z = str2double(['0.' reshape(chars(:,3:end), 1, [])]);
>> cellstr(['x=' num2str(x, '%.17f'); 'y=' num2str(y, '%.17f'); 'z=' num2str(z, '%.17f')])
ans =
'x=0.65549803980353738'
'y=0.10975505072305158'
'z=0.61505947958500362'
Edit This answers the original request for a transformation f(x,y) -> t ~ U[0,1] given x,y ~ U[0,1], and additionally for x and y correlated. The updated question asks specifically for a Hilbert curve, H(x,y) -> t ~ U[0,1] and only for x,y ~ U[0,1] so this answer is no longer relevant.
Consider a random uniform sequence in [0,1] r1, r2, r3, .... You are assigning this sequence to pairs of numbers (x1,y1), (x2,y2), .... What you are asking for is a transformation on pairs (x,y) which yield a uniform random number in [0,1].
Consider the random subsequence r1, r3, ... corresponding to x1, x2, .... If you trust that your number generator is random and uncorrelated in [0,1], then the subsequence x1, x2, ... should also be random and uncorrelated in [0,1]. So the rather simple answer to the first part of your question is a projection onto either the x or y axis. That is, just pick x.
Next consider correlations between x and y. Since you haven't specified the nature of the correlation, let's assume a simple scaling of the axes,
such as x' => [0, 0.5], y' => [0, 3.0], followed by a rotation. The scaling doesn't introduce any correlation since x' and y' are still independent. You can generate it easily enough with a matrix multiply:
M1*p = [x_scale, 0; 0, y_scale] * [x; y]
for matrix M1 and point p. You can introduce a correlation by taking this stretched form and rotating it by theta:
M2*M1*p = [cos(theta), sin(theta); -sin(theta), cos(theta)]*M1*p
Putting it all together with theta = pi/4, or 45 degrees, you can see that larger values of y are correlated with larger values of x:
cos_t = sin_t = cos(pi/4); % at 45 degrees, sin(t) = cos(t) = 1/sqrt(2)
M2 = [cos_t, sin_t; -sin_t, cos_t];
M1 = [0.5, 0.0; 0.0, 3.0];
p = random(2,1000);
p_prime = M2*M1*p;
plot(p_prime(1)', p_prime(2)', '.');
axis('equal');
The resulting plot* shows a band of uniformly distributed numbers at a 45 degree angle:
Further transformations are possible with shear, and if you are clever about it, translation (OpenGL uses 4x4 transformation matrices so that translation can be represented as a linear transform matrix, with an extra dimension added before the transformation steps and removed before they are done).
Given a known affine correlation structure, you can transform back from random points (x',y') to points (x,y) where x and y are independent in [0,1] by solving Mk*...*M1 p = p_prime for p, or equivalently, by setting p = inv(Mk*...*M1) * p_prime, where p=[x;y]. Again, just pick x, which will be uniform in [0,1]. This doesn't work if the transformation matrix is singular, e.g., if you introduce a projection matrix Mj into the mix (though if the projection is the first step you can still recover).
* You may notice that the plot is from python rather than matlab. I don't have matlab or octave sitting in front of me right now, so I hope I got the syntax details right.
You could compute the hilbert curve from f(x,y)=z. Basically it's a hamiltonian path traversal. You can find a good description at Nick's spatial index hilbert curve quadtree blog. Or take a look at monotonic n-ary gray code. I've written an implementation based on Nick's blog in php:http://monstercurves.codeplex.com.
I will focus only on your last point
(3) Transform each draw in [0,1]x[0,1] in a draw in [0,1] using the Hilbert space filling curve: under the Hilbert curve mapping, the draw in [0,1]x[0,1] should be the image of one (or more because of surjectivity) point(s) in [0,1]. I want pick one of these points. Is there any pre-built package in Matlab doing this?
As far as I know, there aren't pre-built packages in Matlab doing this, but the good news is that the code on wikipedia can be called from MATLAB, and it is as simple as putting together the conversion routine with a gateway function in a xy2d.c file:
#include "mex.h"
// source: https://en.wikipedia.org/wiki/Hilbert_curve
// rotate/flip a quadrant appropriately
void rot(int n, int *x, int *y, int rx, int ry) {
if (ry == 0) {
if (rx == 1) {
*x = n-1 - *x;
*y = n-1 - *y;
}
//Swap x and y
int t = *x;
*x = *y;
*y = t;
}
}
// convert (x,y) to d
int xy2d (int n, int x, int y) {
int rx, ry, s, d=0;
for (s=n/2; s>0; s/=2) {
rx = (x & s) > 0;
ry = (y & s) > 0;
d += s * s * ((3 * rx) ^ ry);
rot(s, &x, &y, rx, ry);
}
return d;
}
/* The gateway function */
void mexFunction( int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
int n; /* input scalar */
int x; /* input scalar */
int y; /* input scalar */
int *d; /* output scalar */
/* check for proper number of arguments */
if(nrhs!=3) {
mexErrMsgIdAndTxt("MyToolbox:arrayProduct:nrhs","Three inputs required.");
}
if(nlhs!=1) {
mexErrMsgIdAndTxt("MyToolbox:arrayProduct:nlhs","One output required.");
}
/* get the value of the scalar inputs */
n = mxGetScalar(prhs[0]);
x = mxGetScalar(prhs[1]);
y = mxGetScalar(prhs[2]);
/* create the output */
plhs[0] = mxCreateDoubleScalar(xy2d(n,x,y));
/* get a pointer to the output scalar */
d = mxGetPr(plhs[0]);
}
and compile it with mex('xy2d.c').
The above implementation
[...] assumes a square divided into n by n cells, for n a power of 2, with integer coordinates, with (0,0) in the lower left corner, (n-1,n-1) in the upper right corner.
In practice, a discretization step is required before applying the mapping. As in every discretization problem, it is crucial to choose the precision wisely. The snippet below puts everything together.
close all; clear; clc;
% number of random samples
NSAMPL = 100;
% unit square divided into n-by-n cells
% has to be a power of 2
n = 2^2;
% quantum
d = 1/n;
N = 0:d:1;
% generate random samples
x = rand(1,NSAMPL);
y = rand(1,NSAMPL);
% discretization
bX = floor(x/d);
bY = floor(y/d);
% 2d to 1d mapping
dd = zeros(1,NSAMPL);
for iid = 1:length(dd)
dd(iid) = xy2d(n, bX(iid), bY(iid));
end
figure;
hold on;
axis equal;
plot(x, y, '.');
plot(repmat([0;1], 1, length(N)), repmat(N, 2, 1), '-r');
plot(repmat(N, 2, 1), repmat([0;1], 1, length(N)), '-r');
figure;
plot(1:NSAMPL, dd);
xlabel('# of sample')

Need to make an efficient vector handling algorithm for gravity simulation

So I'm currently working on a Java Processing program where I want to simulate high numbers of particles interacting with collision and gravity. This obviously causes some performance issue when particle count gets high, so I try my best to optimize and avoid expensive operations such as square-root, otherwise used in finding distance between two points.
However, now I'm wondering how I could do the algoritm that figures out the direction a particle should move, given it only knows the distance squared and the difference between particles' x and y (dx, dy).
Here's a snip of the code (yes, I know I should use vectors instead of seperate x/y-couples. Yes, I know I should eventually handle particles by grids and clusters for further optimization) Anyways:
void applyParticleGravity(){
int limit = 2*particleRadius+1; //Gravity no longer applied if particles are within collision reach of eachother.
float ax, ay, bx, by, dx, dy;
float distanceSquared, f;
float gpp = GPP; //Constant is used, since simulation currently assumes all particles have equal mass: GPP = Gravity constant * Particle Mass * Particle Mass
Vector direction = new Vector();
Particle a, b;
int nParticles = particles.size()-1; //"particles" is an arraylist with particles objects, each storing an x/y coordinate and velocity.
for (int i=0; i<nParticles; i++){
a = particles.get(i);
ax = a.x;
ay = a.y;
for (int j=i+1; j<nParticles; j++){
b = particles.get(j);
bx = b.x;
by = b.y;
dx = ax-bx;
dy = ay-by;
if (Math.abs(dx) > limit && Math.abs(dy) > limit){ //Not too close to eachother
distanceSquared = dx*dx + dy*dy; //Avoiding square roots
f = gpp/distanceSquared; //Gravity formula: Force = G*(m1*m2)/d^2
//Perform some trigonometric magic to decide direction.x and direction.y as a numbet between -1 and 1.
a.fx += f*direction.x; //Adds force to particle. At end of main iteration, x-position is increased by fx/mass and so forth.
a.fy += f*direction.y;
b.fx -= f*direction.x; //Apply inverse force to other particle (Newton's 3rd law)
b.fy -= f*direction.y;
}
}
}
}
Is there a more accurate way of deciding the x and y pull strength with some trigonometric magic without killing performance when particles are several hundreds? Something I thought about was doing some sort of (int)dx/dy with % operator or so and get an index of a pre-calculated array of values.
Anyone have a clue? Thanks!
hehe, I think we're working on the same kind of thing, except I'm using HTML5 canvas. I came across this trying to figure out the same thing. I didn't find anything but I figured out what I was going for, and I think it will work for you too.
You want an identity vector that points from one particle to the another. The length will be 1, and x and y will be between -1 and 1. Then you take this identity vector and multiply it by your force scalar, which you're already calculating
To "point at" one particle from another, without using square root, first get the heading (in radians) from particle1 to particle2:
heading = Math.atan2(dy, dx)
Note that y is first, I think this is how it works in Java. I used x first in Javascript and that worked for me.
Get the x and y components of this heading using sin/cos:
direction.x = Math.sin(heading)
direction.y = Math.cos(heading)
You can see an example here:
https://github.com/nijotz/triforces/blob/c7b85d06cf8a65713d9b84ae314d5a4a015876df/src/cljs/triforces/core.cljs#L41
It's Clojurescript, but it may help.

Unprojecting Screen coords to world in OpenGL es 2.0

Long time listener, first time caller.
So I have been playing around with the Android NDK and I'm at a point where I want to Unproject a tap to world coordinates but I can't make it work.
The problem is the x and y values for both the near and far points are the same which doesn't seem right for a perspective projection. Everything in the scene draws OK so I'm a bit confused why it wouldn't unproject properly, anyway here is my code please help thanks
//x and y are the normalized screen coords
ndk_helper::Vec4 nearPoint = ndk_helper::Vec4(x, y, 1.f, 1.f);
ndk_helper::Vec4 farPoint = ndk_helper::Vec4(x, y, 1000.f, 1.f);
ndk_helper::Mat4 inverseProjView = this->matProjection * this->matView;
inverseProjView = inverseProjView.Inverse();
nearPoint = inverseProjView * nearPoint;
farPoint = inverseProjView * farPoint;
nearPoint = nearPoint *(1 / nearPoint.w_);
farPoint = farPoint *(1 / farPoint.w_);
Well, after looking at the vector/matrix math code in ndk_helper, this isn't a surprise. In short: Don't use it. After scanning through it for a couple of minutes, it has some obvious mistakes that look like simple typos. And particularly the Vec4 class is mostly useless for the kind of vector operations you need for graphics. Most of the operations assume that a Vec4 is a vector in 4D space, not a vector containing homogenous coordinates in 3D space.
If you want, you can check it out here, but be prepared for a few face palms:
https://android.googlesource.com/platform/development/+/master/ndk/sources/android/ndk_helper/vecmath.h
For example, this is the implementation of the multiplication used in the last two lines of your code:
Vec4 operator*( const float& rhs ) const
{
Vec4 ret;
ret.x_ = x_ * rhs;
ret.y_ = y_ * rhs;
ret.z_ = z_ * rhs;
ret.w_ = w_ * rhs;
return ret;
}
This multiplies a vector in 4D space by a scalar, but is completely wrong if you're operating with homogeneous coordinates. Which explains the results you are seeing.
I would suggest that you either write your own vector/matrix library that is suitable for graphics type operations, or use one of the freely available libraries that are tested, and used by others.
BTW, the specific values you are using for your test look somewhat odd. You definitely should not be getting the same results for the two vectors, but it's probably not what you had in mind anyway. For the z coordinate in your input vectors, you are using the distances of the near and far planes in eye coordinates. But then you apply the inverse view-projection matrix to those vectors, which transforms them back from clip/NDC space into world space. So your input vectors for this calculation should be in clip/NDC space, which means the z-coordinate values corresponding to the near/far plane should be at -1 and 1.

Ray-triangle intersection

I saw that Fast Minimum Storage Ray/Triangle Intersection by Moller and Trumbore is frequently recommended.
The thing is, I don't mind pre-computing and storing any amounts of data, as long as it speeds-up the intersection.
So my question is, not caring about memory, what are the fastest methods of doing ray-triangle intersection?
Edit: I wont move the triangles, i.e. it is a static scene.
As others have mentioned, the most effective way to speed things up is to use an acceleration structure to reduce the number of ray-triangle intersections needed. That said, you still want your ray-triangle intersections to be fast. If you're happy to precompute stuff, you can try the following:
Convert your ray lines and your triangle edges to Plücker coordinates. This allows you to determine if your ray line passes through a triangle at 6 multiply/add's per edge. You will still need to compare your ray start and end points with the triangle plane (at 4 multiply/add's per point) to make sure it actually hits the triangle.
Worst case runtime expense is 26 multiply/add's total. Also, note that you only need to compute the ray/edge sign once per ray/edge combination, so if you're evaluating a mesh, you may be able to use each edge evaluation twice.
Also, these numbers assume everything is being done in homogeneous coordinates. You may be able to reduce the number of multiplications some by normalizing things ahead of time.
I have done a lot of benchmarks, and I can confidently say that the fastest (published) method is the one invented by Havel and Herout and presented in their paper Yet Faster Ray-Triangle Intersection (Using SSE4). Even without using SSE it is about twice as fast as Möller and Trumbore's algorithm.
My C implementation of Havel-Herout:
typedef struct {
vec3 n0; float d0;
vec3 n1; float d1;
vec3 n2; float d2;
} isect_hh_data;
void
isect_hh_pre(vec3 v0, vec3 v1, vec3 v2, isect_hh_data *D) {
vec3 e1 = v3_sub(v1, v0);
vec3 e2 = v3_sub(v2, v0);
D->n0 = v3_cross(e1, e2);
D->d0 = v3_dot(D->n0, v0);
float inv_denom = 1 / v3_dot(D->n0, D->n0);
D->n1 = v3_scale(v3_cross(e2, D->n0), inv_denom);
D->d1 = -v3_dot(D->n1, v0);
D->n2 = v3_scale(v3_cross(D->n0, e1), inv_denom);
D->d2 = -v3_dot(D->n2, v0);
}
inline bool
isect_hh(vec3 o, vec3 d, float *t, vec2 *uv, isect_hh_data *D) {
float det = v3_dot(D->n0, d);
float dett = D->d0 - v3_dot(o, D->n0);
vec3 wr = v3_add(v3_scale(o, det), v3_scale(d, dett));
uv->x = v3_dot(wr, D->n1) + det * D->d1;
uv->y = v3_dot(wr, D->n2) + det * D->d2;
float tmpdet0 = det - uv->x - uv->y;
int pdet0 = ((int_or_float)tmpdet0).i;
int pdetu = ((int_or_float)uv->x).i;
int pdetv = ((int_or_float)uv->y).i;
pdet0 = pdet0 ^ pdetu;
pdet0 = pdet0 | (pdetu ^ pdetv);
if (pdet0 & 0x80000000)
return false;
float rdet = 1 / det;
uv->x *= rdet;
uv->y *= rdet;
*t = dett * rdet;
return *t >= ISECT_NEAR && *t <= ISECT_FAR;
}
One suggestion could be to implement the octree (http://en.wikipedia.org/wiki/Octree) algorithm to partition your 3D Space into very fine blocks. The finer the partitioning the more memory required, but the better accuracy the tree gets.
You still need to check ray/triangle intersections, but the idea is that the tree can tell you when you can skip the ray/triangle intersection, because the ray is guaranteed not to hit the triangle.
However, if you start moving your triangle around, you need to update the Octree, and then I'm not sure it's going to save you anything.
Found this article by Dan Sunday:
Based on a count of the operations done up to the first rejection test, this algorithm is a bit less efficient than the MT (Möller & Trumbore) algorithm, [...]. However, the MT algorithm uses two cross products whereas our algorithm uses only one, and the one we use computes the normal vector of the triangle's plane, which is needed to compute the line parameter rI. But, when the normal vectors have been precomputed and stored for all triangles in a scene (which is often the case), our algorithm would not have to compute this cross product at all. But, in this case, the MT algorithm would still compute two cross products, and be less efficient than our algorithm.
http://geomalgorithms.com/a06-_intersect-2.html

Resources