Efficiently calculating normal using DirectXMath

Efficiently calculating normal using DirectXMath - directx-11

I need to calcuate the normals of some triangles where I have a vector of vertices where each vertex has x, y, z coordinates. i1, i2, i3 are the indices in the vector of the three vertices of a triangle.
I'm using <DirectXMath.h> and wrote this which seems to work.
XMFLOAT3 normal;
///
XMVECTOR v1 = XMLoadFloat3(&XMFLOAT3(verts[i1].x, verts[i1].y, verts[i1].z));
XMVECTOR v2 = XMLoadFloat3(&XMFLOAT3(verts[i2].x, verts[i2].y, verts[i2].z));
XMVECTOR v3 = XMLoadFloat3(&XMFLOAT3(verts[i3].x, verts[i3].y, verts[i3].z));
XMVECTOR n = XMVector3Cross(XMVectorSubtract(v2 ,v1), XMVectorSubtract(v3 ,v1));
XMStoreFloat3(&normal, n);
However it appears to have more Loads and Stores than actual calculations and was wondering if there was a better way to actually do this? Or are the load and stores "cheap" operations?
I have to run this for every triangle and it's taking a large amount of time relative to the rest of my code so speed improvements would be welcome.

Try adding #define _XM_NO_INTRINSICS_ prior to #include <DirectXMath.h>. This will disable the use of SSE within the library, allowing the compiler more freedom to make its own optimizations.

Related

How to improve texture access performance in OpenGL shaders?

Conditions
I use OpenGL 3 and PyOpenGL.
I have ~50 thousand (53'490) vertices and each of them has 199 vec3 attributes which determine their displacement. It's impossible to store this data as regular vertices attributes, so I use texture.
The problem is: non-parallelized C function calculates displacement of vertices as fast as GLSL and even faster in some cases. I've checked: the issue is texture read and I don't understand how to optimize it.
I've written two different shaders. One calculates new model in ~0.09s and another one in ~0.12s (including attributes assignment, which is equal for both cases).
Code
Both shaders start with
#version 300 es
in vec3 vin_position;
out vec4 vin_pos;
uniform mat4 rotation_matrix;
uniform float coefficients[199];
uniform sampler2D principal_components;
The faster one is
void main(void) {
int c_pos = gl_VertexID;
int texture_size = 8192;
ivec2 texPos = ivec2(c_pos % texture_size, c_pos / texture_size);
vec4 tmp = vec4(0.0);
for (int i = 0; i < 199; i++) {
tmp += texelFetch(principal_components, texPos, 0) * coefficients[i];
c_pos += 53490;
texPos = ivec2(c_pos % texture_size, c_pos / texture_size);
}
gl_Position = rotation_matrix
* vec4(vin_position + tmp.xyz, 246006.0);
vin_pos = gl_Position;
}
The slower one
void main(void) {
int texture_size = 8192;
int columns = texture_size - texture_size % 199;
int c_pos = gl_VertexID * 199;
ivec2 texPos = ivec2(c_pos % columns, c_pos / columns);
vec4 tmp = vec3(0.0);
for (int i = 0; i < 199; i++) {
tmp += texelFetch(principal_components, texPos, 0) * coefficients[i];
texPos.x++;
}
gl_Position = rotation_matrix
* vec4(vin_position + tmp.xyz, 246006.0);
vin_pos = gl_Position;
}
The main idea of difference between them:
in the first case attributes of vertices are stored in following way:
first attributes of all vertices
second attributes of all vertices
...
last attributes of all vertices
in the second case attributes of vertices are stored in another way:
all attributes of the first vertex
all attributes of the second vertex
...
all attributes of the last vertex
also in the second example data is aligned so that all attributes of each vertex stored only in one row. This means that if I know the row and column of the first attribute of some vertex, I need only to increment x component of texture coordinate
I thought, that aligned data will be accessed faster.
Questions
Why is data not accessed faster?
How can I increase performance of it?
Is there ability to link texture chunk with vertex?
Are there recommendations for data alignment, good related article about caching in GPUs (Intel HD, nVidia GeForce)?
Notes
coefficients array changed from frame to frame, otherwise there's no problem: I could precalculate the model and be happy

Why is data not accessed faster?
Because GPUs are not magical. GPUs gain performance by performing calculations in parallel. Performing 1 million texel fetches, no matter how it happens, is not going to be fast.
If you were using the results of those textures to do lighting computations, it would appear fast because the cost of the lighting computation would be hidden by the latency of the memory fetches. You are taking the results of a fetch, doing a multiply/add, then doing another fetch. That's slow.
Is there ability to link texture chunk with vertex?
Even if there was (and there isn't), how would that help? GPUs execute operations in parallel. That means multiple vertices are being processed simultaneously, each accessing 200 textures.
So what would aid performance there is making each texture access coherent. That is, neighboring vertices would access neighboring texels, thus making the texture fetches more cache efficient. But there's no way to know what vertices will be considered "neighbors". And texture swizzle layouts are implementation dependent, so even if you did know the order of vertex processing, you couldn't adjust your texture to take local advantage of it.
The best way to do that would be to ditch vertex shaders and texture accesses in favor of compute shaders and SSBOs. That way, you have direct knowledge of the locality of your accesses, by setting the work group size. With SSBOs, you can arrange your array in whatever fashion gives you the best locality of access for each wavefront.
But things like this are the equivalent of putting band-aids on a gaping wound.
How can I increase performance of it?
Stop doing so many texture fetches.
I'm being completely serious. While there are ways to mitigate the costs of what you're doing, the most effective solution is to change your algorithm so that it doesn't need to do that much work.
Your algorithm looks suspiciously like vertex morphing via a palette of "poses", with the coefficient specifying the weight applied to each pose. If that's the case, then odds are good that most of your coefficients are either 0 or negligibly small. If so, then you're wasting vast amounts of time accessing textures only to transform their contributions into nothing.
If most of your coefficients are 0, then the best thing to do would be to pick some arbitrary and small number for the maximum number of coefficients that can affect the result. For example, 8. You send an array of 8 indices and coefficients to the shader as uniforms. Then you walk that array, fetching only 8 times. And you might be able to get away with just 4.

Nearest Neighbors in CUDA Particles

Edit 2: Please take a look at this crosspost for TLDR.
Edit: Given that the particles are segmented into grid cells (say 16^3 grid), is it a better idea to let run one work-group for each grid cell and as many work-items in one work-group as there can be maximal number of particles per grid cell?
In that case I could load all particles from neighboring cells into local memory and iterate through them computing some properties. Then I could write specific value into each particle in the current grid cell.
Would this approach be beneficial over running the kernel for all particles and for each iterating over (most of the time the same) neighbors?
Also, what is the ideal ratio of number of particles/number of grid cells?
I'm trying to reimplement (and modify) CUDA Particles for OpenCL and use it to query nearest neighbors for every particle. I've created the following structures:
Buffer P holding all particles' 3D positions (float3)
Buffer Sp storing int2 pairs of particle ids and their spatial hashes. Sp is sorted according to the hash. (The hash is just a simple linear mapping from 3D to 1D – no Z-indexing yet.)
Buffer L storing int2 pairs of starting and ending positions of particular spatial hashes in buffer Sp. Example: L[12] = (int2)(0, 50).
L[12].x is the index (in Sp) of the first particle with spatial hash 12.
L[12].y is the index (in Sp) of the last particle with spatial hash 12.
Now that I have all these buffers, I want to iterate through all the particles in P and for each particle iterate through its nearest neighbors. Currently I have a kernel that looks like this (pseudocode):
__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
size_t gid = get_global_id(0);
float3 curr_particle = P[gid];
int processed_value = 0;
for(int x=-1; x<=1; x++)
for(int y=-1; y<=1; y++)
for(int z=-1; z<=1; z++) {
float3 neigh_position = curr_particle + (float3)(x,y,z)*GRID_CELL_SIDE;
// ugly boundary checking
if ( dot(neigh_position<0, (float3)(1)) +
dot(neigh_position>BOUNDARY, (float3)(1)) != 0)
continue;
int neigh_hash = spatial_hash( neigh_position );
int2 particles_range = L[ neigh_hash ];
for(int p=particles_range.x; p<particles_range.y; p++)
processed_value += heavy_computation( P[ Sp[p].y ] );
}
Out[gid] = processed_value;
}
The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y] in the inner-most for loop) to be causing the slowness.
What I want to do is to use Z-order curve as the spatial hash. That way I could have only 1 for loop iterating through a continuous range of memory when querying neighbors. The only problem is that I don't know what should be the start and stop Z-index values.
The holy grail I want to achieve:
__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
size_t gid = get_global_id(0);
float3 curr_particle = P[gid];
int processed_value = 0;
// How to accomplish this??
// `get_neighbors_range()` returns start and end Z-index values
// representing the start and end near neighbors cells range
int2 nearest_neighboring_cells_range = get_neighbors_range(curr_particle);
int first_particle_id = L[ nearest_neighboring_cells_range.x ].x;
int last_particle_id = L[ nearest_neighboring_cells_range.y ].y;
for(int p=first_particle_id; p<=last_particle_id; p++) {
processed_value += heavy_computation( P[ Sp[p].y ] );
}
Out[gid] = processed_value;
}

You should study the Morton Code algorithms closely. Ericsons Real time collision detection explains that very well.
Ericson - Real time Collision detection
Here is another nice explanation including some tests:
Morton encoding/decoding through bit interleaving: Implementations
Z-Order algorithms only defines the paths of the coordinates in which you can hash from 2 or 3D coordinates to just an integer. Although the algorithm goes deeper for every iteration you have to set the limits yourself. Usually the stop index is denoted by a sentinel. Letting the sentinel stop will tell you at which level the particle is placed. So the maximum level you want to define will tell you the number of cells per dimension. For example with maximum level at 6 you have 2^6 = 64. You will have 64x64x64 cells in your system (3D). That also means that you have to use integer based coordinates. If you use floats you have to convert like coord.x = 64*float_x and so on.
If you know how many cells you have in your system you can define your limits. Are you trying to use a binary octree?
Since particles are in motion (in that CUDA example) you should try to parallelize over the number of particles instead of cells.
If you want to build lists of nearest neighbours you have to map the particles to cells. This is done through a table that is sorted afterwards by cells to particles. Still you should iterate through the particles and access its neighbours.
About your code:
The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y] in the inner-most for loop) to be causing the slowness.
Remember Donald Knuth. You should measure where the bottle neck is. You can use NVCC Profiler and look for bottleneck. Not sure what OpenCL has as profiler.
// ugly boundary checking
if ( dot(neigh_position<0, (float3)(1)) +
dot(neigh_position>BOUNDARY, (float3)(1)) != 0)
continue;
I think you should not branch it that way, how about returning zero when you call heavy_computation. Not sure, but maybe you have sort of a branch prediction here. Try to remove that somehow.
Running parallel over the cells is a good idea only if you have no write accesses to the particle data, otherwise you will have to use atomics. If you go over the particle range instead you read accesses to the cells and neighbours but you create your sum in parallel and you are not forced to some race condiction paradigm.
Also, what is the ideal ratio of number of particles/number of grid cells?
Really depends on your algorithms and the particle packing within your domain, but in your case I would define the cell size equivalent to the particle diameter and just use the number of cells you get.
So if you want to use Z-order and achieve your holy grail, try to use integer coordinates and hash them.
Also try to use larger amounts of particles. About 65000 particles like CUDA examples uses you should consider because that way the parallelisation is mostly efficient; the running processing units are exploited (fewer idles threads).

Applying a "Spread" value to an XMFLOAT4X4

I'm attempting to add a small value to a World Matrix in order to replicate the accuracy of a fired weapon [pistol, assault rifle]
Currently, my World Matrix resides at a Parent Objects' position, with the ability to rotate about the Y axis exclusively.
I've done this in Unity3D, running whenever the object needs to be created [once per]:
var coneRotation = Quaternion.Euler(Random.Range(-spread, spread), Random.Range(-spread, spread), 0);
var go = Instantiate(obj, parent.transform.position, transform.rotation * coneRotation) as GameObject;
and am attempting to replicate the results using Direct3D11.
This lambda returns a random value between [-1.5, 1.5] currently:
auto randF = [&](float lower_bound, float uppder_bound) -> float
{
return lower_bound + static_cast <float> (rand()) / (static_cast <float> (RAND_MAX / (uppder_bound - lower_bound)));
};
My first thought was to simply multiply a random x && y into the forward vector of an object upon initialization, and move it in this fashion: position = position + forward * speed * dt; [speed being 1800], though the rotation is incorrect (not to mention bullets fire up).
I've also attempted to make a Quaternion [as in Unity3D]: XMVECTOR quaternion = XMVectorSet(random_x, random_y, 0) and creating a Rotation Matrix using XMMatrixRotationQuaternion.
Afterwards I call XMStoreFloat4x4(&world_matrix, XMLoadFloat4x4(&world_matrix) * rotation);, and restore the position portion of the matrix [accessing world_matrix._41/._42/._43] (world_matrix being the matrix of the "bullet" itself, not the parent).
[I've also tried to reverse the order of the multiplication]
I've read that the XMMatrixRotationQuaternion doesn't return as an Euler Quaternion, and XMQuaternionToAxisAngle does, though I'm not entirely certain how to use it.
What would be the proper way to accomplish something like this?
Many thanks!

Your code XMVECTOR quaternion = XMVectorSet(random_x, random_y, 0); is not creating a valid quaternion. First, if you did not set the w component to 1, then the 4-vector quaternion doesn't actually represent a 3D rotation. Second, a quaternion's vector components are not Euler angles.
You want to use XMQuaternionRotationRollPitchYaw which constructs a quaternion rotation from Euler angle input, or XMQuaternionRotationRollPitchYawFromVector which takes the three Euler angles as a vector. These functions are doing what Unity's Quaternion.Euler method is doing.
Of course, if you want a rotation matrix and not a quaternion, then you can XMMatrixRotationRollPitchYaw or XMMatrixRotationRollPitchYawFromVector to directly construct a 4x4 rotation matrix from Euler angles--which actually uses quaternions internally anyhow. Based on your code snippet, it looks like you already have a base rotation as a quaternion you want to concatenate with your spread quaternion, so you probably don't want to use this option for this case.
Note: You should look at using the C++11 standard <random> rather than your home-rolled lambda wrapper around the terrible C rand function.
Something like:
std::random_device rd;
std::mt19937 gen(rd());
// spread should be in radians here (not degrees which is what Unity uses)
std::uniform_real_distribution<float> dis(-spread, spread);
XMVECTOR coneRotation = XMQuaternionRotationRollPitchYaw( dis(gen), dis(gen), 0 );
XMVECTOR rot = XMQuaternionMultiply( parentRot, coneRotation );
XMMATRIX transform = XMMatrixAffineTransformation( g_XMOne, g_XMZero, rot, parentPos );
BTW, if you are used to Unity or XNA Game Studio C# math libraries, you might want to check out the SimpleMath wrapper for DirectXMath in DirectX Tool Kit.

Finding translation and scale on two sets of points to get least square error in their distance?

I have two sets of 3D points (original and reconstructed) and correspondence information about pairs - which point from one set represents the second one. I need to find 3D translation and scaling factor which transforms reconstruct set so the sum of square distances would be least (rotation would be nice too, but points are rotated similarly, so this is not main priority and might be omitted in sake of simplicity and speed). And so my question is - is this solved and available somewhere on the Internet? Personally, I would use least square method, but I don't have much time (and although I'm somewhat good at math, I don't use it often, so it would be better for me to avoid it), so I would like to use other's solution if it exists. I prefer solution in C++, for example using OpenCV, but algorithm alone is good enough.
If there is no such solution, I will calculate it by myself, I don't want to bother you so much.
SOLUTION: (from your answers)
For me it's Kabsch alhorithm;
Base info: http://en.wikipedia.org/wiki/Kabsch_algorithm
General solution: http://nghiaho.com/?page_id=671
STILL NOT SOLVED:
I also need scale. Scale values from SVD are not understandable for me; when I need scale about 1-4 for all axises (estimated by me), SVD scale is about [2000, 200, 20], which is not helping at all.

Since you are already using Kabsch algorithm, just have a look at Umeyama's paper which extends it to get scale. All you need to do is to get the standard deviation of your points and calculate scale as:
(1/sigma^2)*trace(D*S)
where D is the diagonal matrix in SVD decomposition in the rotation estimation and S is either identity matrix or [1 1 -1] diagonal matrix, depending on the sign of determinant of UV (which Kabsch uses to correct reflections into proper rotations). So if you have [2000, 200, 20], multiply the last element by +-1 (depending on the sign of determinant of UV), sum them and divide by the standard deviation of your points to get scale.
You can recycle the following code, which is using the Eigen library:
typedef Eigen::Matrix<double, 3, 1, Eigen::DontAlign> Vector3d_U; // microsoft's 32-bit compiler can't put Eigen::Vector3d inside a std::vector. for other compilers or for 64-bit, feel free to replace this by Eigen::Vector3d
/**
* #brief rigidly aligns two sets of poses
*
* This calculates such a relative pose <tt>R, t</tt>, such that:
*
* #code
* _TyVector v_pose = R * r_vertices[i] + t;
* double f_error = (r_tar_vertices[i] - v_pose).squaredNorm();
* #endcode
*
* The sum of squared errors in <tt>f_error</tt> for each <tt>i</tt> is minimized.
*
* #param[in] r_vertices is a set of vertices to be aligned
* #param[in] r_tar_vertices is a set of vertices to align to
*
* #return Returns a relative pose that rigidly aligns the two given sets of poses.
*
* #note This requires the two sets of poses to have the corresponding vertices stored under the same index.
*/
static std::pair<Eigen::Matrix3d, Eigen::Vector3d> t_Align_Points(
const std::vector<Vector3d_U> &r_vertices, const std::vector<Vector3d_U> &r_tar_vertices)
{
_ASSERTE(r_tar_vertices.size() == r_vertices.size());
const size_t n = r_vertices.size();
Eigen::Vector3d v_center_tar3 = Eigen::Vector3d::Zero(), v_center3 = Eigen::Vector3d::Zero();
for(size_t i = 0; i < n; ++ i) {
v_center_tar3 += r_tar_vertices[i];
v_center3 += r_vertices[i];
}
v_center_tar3 /= double(n);
v_center3 /= double(n);
// calculate centers of positions, potentially extend to 3D
double f_sd2_tar = 0, f_sd2 = 0; // only one of those is really needed
Eigen::Matrix3d t_cov = Eigen::Matrix3d::Zero();
for(size_t i = 0; i < n; ++ i) {
Eigen::Vector3d v_vert_i_tar = r_tar_vertices[i] - v_center_tar3;
Eigen::Vector3d v_vert_i = r_vertices[i] - v_center3;
// get both vertices
f_sd2 += v_vert_i.squaredNorm();
f_sd2_tar += v_vert_i_tar.squaredNorm();
// accumulate squared standard deviation (only one of those is really needed)
t_cov.noalias() += v_vert_i * v_vert_i_tar.transpose();
// accumulate covariance
}
// calculate the covariance matrix
Eigen::JacobiSVD<Eigen::Matrix3d> svd(t_cov, Eigen::ComputeFullU | Eigen::ComputeFullV);
// calculate the SVD
Eigen::Matrix3d R = svd.matrixV() * svd.matrixU().transpose();
// compute the rotation
double f_det = R.determinant();
Eigen::Vector3d e(1, 1, (f_det < 0)? -1 : 1);
// calculate determinant of V*U^T to disambiguate rotation sign
if(f_det < 0)
R.noalias() = svd.matrixV() * e.asDiagonal() * svd.matrixU().transpose();
// recompute the rotation part if the determinant was negative
R = Eigen::Quaterniond(R).normalized().toRotationMatrix();
// renormalize the rotation (not needed but gives slightly more orthogonal transformations)
double f_scale = svd.singularValues().dot(e) / f_sd2_tar;
double f_inv_scale = svd.singularValues().dot(e) / f_sd2; // only one of those is needed
// calculate the scale
R *= f_inv_scale;
// apply scale
Eigen::Vector3d t = v_center_tar3 - (R * v_center3); // R needs to contain scale here, otherwise the translation is wrong
// want to align center with ground truth
return std::make_pair(R, t); // or put it in a single 4x4 matrix if you like
}

For 3D points the problem is known as the Absolute Orientation problem. A c++ implementation is available from Eigen http://eigen.tuxfamily.org/dox/group__Geometry__Module.html#gab3f5a82a24490b936f8694cf8fef8e60 and paper http://web.stanford.edu/class/cs273/refs/umeyama.pdf
you can use it via opencv by converting the matrices to eigen with cv::cv2eigen() calls.

Start with translation of both sets of points. So that their centroid coincides with the origin of the coordinate system. Translation vector is just the difference between these centroids.
Now we have two sets of coordinates represented as matrices P and Q. One set of points may be obtained from other one by applying some linear operator (which performs both scaling and rotation). This operator is represented by 3x3 matrix X:
P * X = Q
To find proper scale/rotation we just need to solve this matrix equation, find X, then decompose it into several matrices, each representing some scaling or rotation.
A simple (but probably not numerically stable) way to solve it is to multiply both parts of the equation to the transposed matrix P (to get rid of non-square matrices), then multiply both parts of the equation to the inverted PT * P:
PT * P * X = PT * Q
X = (PT * P)-1 * PT * Q
Applying Singular value decomposition to matrix X gives two rotation matrices and a matrix with scale factors:
X = U * S * V
Here S is a diagonal matrix with scale factors (one scale for each coordinate), U and V are rotation matrices, one properly rotates the points so that they may be scaled along the coordinate axes, other one rotates them once more to align their orientation to second set of points.
Example (2D points are used for simplicity):
P = 1 2 Q = 7.5391 4.3455
2 3 12.9796 5.8897
-2 1 -4.5847 5.3159
-1 -6 -15.9340 -15.5511
After solving the equation:
X = 3.3417 -1.2573
2.0987 2.8014
After SVD decomposition:
U = -0.7317 -0.6816
-0.6816 0.7317
S = 4 0
0 3
V = -0.9689 -0.2474
-0.2474 0.9689
Here SVD has properly reconstructed all manipulations I performed on matrix P to get matrix Q: rotate by the angle 0.75, scale X axis by 4, scale Y axis by 3, rotate by the angle -0.25.
If sets of points are scaled uniformly (scale factor is equal by each axis), this procedure may be significantly simplified.
Just use Kabsch algorithm to get translation/rotation values. Then perform these translation and rotation (centroids should coincide with the origin of the coordinate system). Then for each pair of points (and for each coordinate) estimate Linear regression. Linear regression coefficient is exactly the scale factor.

A good explanation Finding optimal rotation and translation between corresponding 3D points
The code is in matlab but it's trivial to convert to opengl using the cv::SVD function

You might want to try ICP (Iterative closest point).
Given two sets of 3d points, it will tell you the transformation (rotation + translation) to go from the first set to the second one.
If you're interested in a c++ lightweight implementation, try libicp.
Good luck!

The general transformation, as well the scale can be retrieved via Procrustes Analysis. It works by superimposing the objects on top of each other and tries to estimate the transformation from that setting. It has been used in the context of ICP, many times. In fact, your preference, Kabash algorithm is a special case of this.
Moreover, Horn's alignment algorithm (based on quaternions) also finds a very good solution, while being quite efficient. A Matlab implementation is also available.

Scale can be inferred without SVD, if your points are uniformly scaled in all directions (I could not make sense of SVD-s scale matrix either). Here is how I solved the same problem:
Measure distances of each point to other points in the point cloud to get a 2d table of distances, where entry at (i,j) is norm(point_i-point_j). Do the same thing for the other point cloud, so you get two tables -- one for original and the other for reconstructed points.
Divide all values in one table by the corresponding values in the other table. Because the points correspond to each other, the distances do too. Ideally, the resulting table has all values being equal to each other, and this is the scale.
The median value of the divisions should be pretty close to the scale you are looking for. The mean value is also close, but I chose median just to exclude outliers.
Now you can use the scale value to scale all the reconstructed points and then proceed to estimating the rotation.
Tip: If there are too many points in the point clouds to find distances between all of them, then a smaller subset of distances will work, too, as long as it is the same subset for both point clouds. Ideally, just one distance pair would work if there is no measurement noise, e.g when one point cloud is directly derived from the other by just rotating it.

you can also use ScaleRatio ICP proposed by BaoweiLin
The code can be found in github

Ray-triangle intersection

I saw that Fast Minimum Storage Ray/Triangle Intersection by Moller and Trumbore is frequently recommended.
The thing is, I don't mind pre-computing and storing any amounts of data, as long as it speeds-up the intersection.
So my question is, not caring about memory, what are the fastest methods of doing ray-triangle intersection?
Edit: I wont move the triangles, i.e. it is a static scene.

As others have mentioned, the most effective way to speed things up is to use an acceleration structure to reduce the number of ray-triangle intersections needed. That said, you still want your ray-triangle intersections to be fast. If you're happy to precompute stuff, you can try the following:
Convert your ray lines and your triangle edges to Plücker coordinates. This allows you to determine if your ray line passes through a triangle at 6 multiply/add's per edge. You will still need to compare your ray start and end points with the triangle plane (at 4 multiply/add's per point) to make sure it actually hits the triangle.
Worst case runtime expense is 26 multiply/add's total. Also, note that you only need to compute the ray/edge sign once per ray/edge combination, so if you're evaluating a mesh, you may be able to use each edge evaluation twice.
Also, these numbers assume everything is being done in homogeneous coordinates. You may be able to reduce the number of multiplications some by normalizing things ahead of time.

I have done a lot of benchmarks, and I can confidently say that the fastest (published) method is the one invented by Havel and Herout and presented in their paper Yet Faster Ray-Triangle Intersection (Using SSE4). Even without using SSE it is about twice as fast as Möller and Trumbore's algorithm.
My C implementation of Havel-Herout:
typedef struct {
vec3 n0; float d0;
vec3 n1; float d1;
vec3 n2; float d2;
} isect_hh_data;
void
isect_hh_pre(vec3 v0, vec3 v1, vec3 v2, isect_hh_data *D) {
vec3 e1 = v3_sub(v1, v0);
vec3 e2 = v3_sub(v2, v0);
D->n0 = v3_cross(e1, e2);
D->d0 = v3_dot(D->n0, v0);
float inv_denom = 1 / v3_dot(D->n0, D->n0);
D->n1 = v3_scale(v3_cross(e2, D->n0), inv_denom);
D->d1 = -v3_dot(D->n1, v0);
D->n2 = v3_scale(v3_cross(D->n0, e1), inv_denom);
D->d2 = -v3_dot(D->n2, v0);
}
inline bool
isect_hh(vec3 o, vec3 d, float *t, vec2 *uv, isect_hh_data *D) {
float det = v3_dot(D->n0, d);
float dett = D->d0 - v3_dot(o, D->n0);
vec3 wr = v3_add(v3_scale(o, det), v3_scale(d, dett));
uv->x = v3_dot(wr, D->n1) + det * D->d1;
uv->y = v3_dot(wr, D->n2) + det * D->d2;
float tmpdet0 = det - uv->x - uv->y;
int pdet0 = ((int_or_float)tmpdet0).i;
int pdetu = ((int_or_float)uv->x).i;
int pdetv = ((int_or_float)uv->y).i;
pdet0 = pdet0 ^ pdetu;
pdet0 = pdet0 | (pdetu ^ pdetv);
if (pdet0 & 0x80000000)
return false;
float rdet = 1 / det;
uv->x *= rdet;
uv->y *= rdet;
*t = dett * rdet;
return *t >= ISECT_NEAR && *t <= ISECT_FAR;
}

One suggestion could be to implement the octree (http://en.wikipedia.org/wiki/Octree) algorithm to partition your 3D Space into very fine blocks. The finer the partitioning the more memory required, but the better accuracy the tree gets.
You still need to check ray/triangle intersections, but the idea is that the tree can tell you when you can skip the ray/triangle intersection, because the ray is guaranteed not to hit the triangle.
However, if you start moving your triangle around, you need to update the Octree, and then I'm not sure it's going to save you anything.

Found this article by Dan Sunday:
Based on a count of the operations done up to the first rejection test, this algorithm is a bit less efficient than the MT (Möller & Trumbore) algorithm, [...]. However, the MT algorithm uses two cross products whereas our algorithm uses only one, and the one we use computes the normal vector of the triangle's plane, which is needed to compute the line parameter rI. But, when the normal vectors have been precomputed and stored for all triangles in a scene (which is often the case), our algorithm would not have to compute this cross product at all. But, in this case, the MT algorithm would still compute two cross products, and be less efficient than our algorithm.
http://geomalgorithms.com/a06-_intersect-2.html

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio