Rendering to a full 3D Render Target in one pass - directx-11

Using DirectX 11, I created a 3D volume texture that can be bound as a render target:
D3D11_TEXTURE3D_DESC texDesc3d;
// ...
texDesc3d.Usage = D3D11_USAGE_DEFAULT;
texDesc3d.BindFlags = D3D11_BIND_RENDER_TARGET;
// Create volume texture and views
m_dxDevice->CreateTexture3D(&texDesc3d, nullptr, &m_tex3d);
m_dxDevice->CreateRenderTargetView(m_tex3d, nullptr, &m_tex3dRTView);
I would now like to update the whole render target and fill it with procedural data generated in a pixel shader, similar to updating a 2D render target with a 'fullscreen pass'. Everything I need to generate the data is the UVW coordinates of the pixel in question.
For 2D, a simple vertex shader that renders a full screen triangle can be built:
struct VS_OUTPUT
{
float4 position : SV_Position;
float2 uv: TexCoord;
};
// input: three empty vertices
VS_OUTPUT main( uint vertexID : SV_VertexID )
{
VS_OUTPUT result;
result.uv = float2((vertexID << 1) & 2, vertexID & 2);
result.position = float4(result.uv * float2(2.0f, -2.0f) + float2(-1.0f, 1.0f), 0.0f, 1.0f);
return result;
}
I have a hard time wrapping my head around how to adopt this principle for 3D. Is this even possible in DirectX 11, or do I have to render to individual slices of the volume texture as described here?

Here is some sample code doing it with pipeline version. You basically batch N triangles and route each instance to a volume slice using Geometry Shader.
struct VS_OUTPUT
{
float4 position : SV_Position;
float2 uv: TexCoord;
uint index: SLICEINDEX;
};
VS_OUTPUT main( uint vertexID : SV_VertexID, uint ii : SV_InstanceID )
{
VS_OUTPUT result;
result.uv = float2((vertexID << 1) & 2, vertexID & 2);
result.position = float4(result.uv * float2(2.0f, -2.0f) + float2(-1.0f, 1.0f), 0.0f, 1.0f);
result.index= ii;
return result;
}
Now you need to call DrawInstanced with 3 vertices and N instances where N is your volume slices count
Then you assign triangles to GS like this:
struct psInput
{
float4 pos : SV_POSITION;
float2 uv: TEXCOORD0;
uint index : SV_RenderTargetArrayIndex; //This will write your vertex to a specific slice, which you can read in pixel shader too
};
[maxvertexcount(3)]
void GS( triangle VS_OUTPUT input[3], inout TriangleStream<psInput> gsout )
{
psInput output;
for (uint i = 0; i < 3; i++)
{
output.pos = input[i].pos;
output.uv = input[i].uv;
output.index= input[0].index; //Use 0 as we need to push a full triangle to the slice
gsout.Append(output);
}
gsout.RestartStrip();
}
Now you have access to slice index in your pixel shader:
float4 PS(psInput input) : SV_Target
{
//Do something with uvs, and use slice input as Z
}
Compute shader version (don't forget to create a UAV for your volume), and numthreads here is totally arbirtary
[numthreads(8,8,8)]
void CS(uint3 tid : SV_DispatchThreadID)
{
//Standard overflow safeguards
//Generate data using tid coordinates
}
Now instead you need to call dispatch with
width/8, height/8, depth/8

Related

Opengl uvw projection, what's under the hood? [duplicate]

If linear interpolation happens during the rasterization stage in the OpenGL pipeline, and the vertices have already been transformed to screen-space, where does the depth information used for perspectively correct interpolation come from?
Can anybody give a detailed description of how OpenGL goes from screen-space primitives to fragments with correctly interpolated values?
The output of a vertex shader is a four component vector, vec4 gl_Position. From Section 13.6 Coordinate Transformations of core GL 4.4 spec:
Clip coordinates for a vertex result from shader execution, which yields a vertex coordinate gl_Position.
Perspective division on clip coordinates yields normalized device coordinates, followed by a viewport transformation (see section 13.6.1) to convert these coordinates into window coordinates.
OpenGL does the perspective divide as
device.xyz = gl_Position.xyz / gl_Position.w
But then keeps the 1 / gl_Position.w as the last component of gl_FragCoord:
gl_FragCoord.xyz = device.xyz scaled to viewport
gl_FragCoord.w = 1 / gl_Position.w
This transform is bijective, so no depth information is lost. In fact as we see below, the 1 / gl_Position.w is crucial for perspective correct interpolation.
Short introduction to barycentric coordinates
Given a triangle (P0, P1, P2) one can parametrize all the points inside the triangle by the linear combinations of the vertices:
P(b0,b1,b2) = P0*b0 + P1*b1 + P2*b2
where b0 + b1 + b2 = 1 and b0 ≥ 0, b1 ≥ 0, b2 ≥ 0.
Given a point P inside the triangle, the coefficients (b0, b1, b2) that satisfy the equation above are called the barycentric coordinates of that point. For non-degenerate triangles they are unique, and can be calculated as quotients of the areas of the following triangles:
b0(P) = area(P, P1, P2) / area(P0, P1, P2)
b1(P) = area(P0, P, P2) / area(P0, P1, P2)
b2(P) = area(P0, P1, P) / area(P0, P1, P2)
Each bi can be thought of as 'how much of Pi has to be mixed in'. So b = (1,0,0), (0,1,0) and (0,0,1) are the vertices of the triangle, (1/3, 1/3, 1/3) is the barycenter, and so on.
Given an attribute (f0, f1, f2) on the vertices of the triangle, we can now interpolate it over the interior:
f(P) = f0*b0(P) + f1*b1(P) + f2*b2(P)
This is a linear function of P, therefore it is the unique linear interpolant over the given triangle. The math also works in either 2D or 3D.
Perspective correct interpolation
Let's say we fill a projected 2D triangle on the screen. For every fragment we have its window coordinates. First we calculate its barycentric coordinates by inverting the P(b0,b1,b2) function, which is a linear function in window coordinates. This gives us the barycentric coordinates of the fragment on the 2D triangle projection.
Perspective correct interpolation of an attribute would vary linearly in the clip coordinates (and by extension, world coordinates). For that we need to get the barycentric coordinates of the fragment in clip space.
As it happens (see [1] and [2]), the depth of the fragment is not linear in window coordinates, but the depth inverse (1/gl_Position.w) is. Accordingly the attributes and the clip-space barycentric coordinates, when weighted by the depth inverse, vary linearly in window coordinates.
Therefore, we compute the perspective corrected barycentric by:
( b0 / gl_Position[0].w, b1 / gl_Position[1].w, b2 / gl_Position[2].w )
B = -------------------------------------------------------------------------
b0 / gl_Position[0].w + b1 / gl_Position[1].w + b2 / gl_Position[2].w
and then use it to interpolate the attributes from the vertices.
Note: GL_NV_fragment_shader_barycentric exposes the device-linear barycentric coordinates through gl_BaryCoordNoPerspNV and the perspective corrected through gl_BaryCoordNV.
Implementation
Here is a C++ code that rasterizes and shades a triangle on the CPU, in a manner similar to OpenGL. I encourage you to compare it with the shaders listed below:
struct Renderbuffer { int w, h, ys; void *data; };
struct Vert { vec4 position, texcoord, color; };
struct Varying { vec4 texcoord, color; };
void vertex_shader(const Vert &in, vec4 &gl_Position, Varying &OUT) {
OUT.texcoord = in.texcoord;
OUT.color = in.color;
gl_Position = vec4(in.position.x, in.position.y, -2*in.position.z - 2*in.position.w, -in.position.z);
}
void fragment_shader(vec4 &gl_FragCoord, const Varying &IN, vec4 &OUT) {
OUT = IN.color;
vec2 wrapped = IN.texcoord.xy - floor(IN.texcoord.xy);
bool brighter = (wrapped[0] < 0.5) != (wrapped[1] < 0.5);
if(!brighter)
OUT.rgb *= 0.5f;
}
// render output unit/render operations pipeline
void rop(Renderbuffer &buf, int x, int y, const vec4 &c) {
uint8_t *p = (uint8_t*)buf.data + buf.ys*(buf.h - y - 1) + 4*x;
p[0] = linear_to_srgb8(c[0]);
p[1] = linear_to_srgb8(c[1]);
p[2] = linear_to_srgb8(c[2]);
p[3] = lround(c[3]*255);
}
void draw_triangle(Renderbuffer &color_attachment, const box2 &viewport, const Vert *verts) {
auto area = [](const vec2 &p0, const vec2 &p1, const vec2 &p2) { return cross(p1 - p0, p2 - p0); };
auto interpolate = [](const auto a[3], auto p, const vec3 &coord) { return coord.x*a[0].*p + coord.y*a[1].*p + coord.z*a[2].*p; };
Varying perVertex[3];
vec4 gl_Position[3];
box2 aabb = { viewport.hi, viewport.lo };
for(int i = 0; i < 3; ++i) {
vertex_shader(verts[i], gl_Position[i], perVertex[i]);
// convert to normalized device coordinates
gl_Position[i].w = 1/gl_Position[i].w;
gl_Position[i].xyz *= gl_Position[i].w;
// convert to window coordinates
gl_Position[i].xy = mix(viewport.lo, viewport.hi, 0.5f*(gl_Position[i].xy + 1.0f));
aabb = join(aabb, gl_Position[i].xy);
}
const float denom = 1/area(gl_Position[0].xy, gl_Position[1].xy, gl_Position[2].xy);
// loop over all pixels in the rectangle bounding the triangle
const ibox2 iaabb = lround(aabb);
for(int y = iaabb.lo.y; y < iaabb.hi.y; ++y)
for(int x = iaabb.lo.x; x < iaabb.hi.x; ++x)
{
vec4 gl_FragCoord;
gl_FragCoord.xy = vec2(x, y) + 0.5f;
// fragment barycentric coordinates in window coordinates
const vec3 barycentric = denom*vec3(
area(gl_FragCoord.xy, gl_Position[1].xy, gl_Position[2].xy),
area(gl_Position[0].xy, gl_FragCoord.xy, gl_Position[2].xy),
area(gl_Position[0].xy, gl_Position[1].xy, gl_FragCoord.xy)
);
// discard fragment outside the triangle. this doesn't handle edges correctly.
if(barycentric.x < 0 || barycentric.y < 0 || barycentric.z < 0)
continue;
// interpolate inverse depth linearly
gl_FragCoord.z = interpolate(gl_Position, &vec4::z, barycentric);
gl_FragCoord.w = interpolate(gl_Position, &vec4::w, barycentric);
// clip fragments to the near/far planes (as if by GL_ZERO_TO_ONE)
if(gl_FragCoord.z < 0 || gl_FragCoord.z > 1)
continue;
// convert to perspective correct (clip-space) barycentric
const vec3 perspective = 1/gl_FragCoord.w*barycentric*vec3(gl_Position[0].w, gl_Position[1].w, gl_Position[2].w);
// interpolate attributes
Varying varying = {
interpolate(perVertex, &Varying::texcoord, perspective),
interpolate(perVertex, &Varying::color, perspective),
};
vec4 color;
fragment_shader(gl_FragCoord, varying, color);
rop(color_attachment, x, y, color);
}
}
int main(int argc, char *argv[]) {
Renderbuffer buffer = { 512, 512, 512*4 };
buffer.data = calloc(buffer.ys, buffer.h);
// VAO interleaved attributes buffer
Vert verts[] = {
{ { -1, -1, -2, 1 }, { 0, 0, 0, 1 }, { 0, 0, 1, 1 } },
{ { 1, -1, -1, 1 }, { 10, 0, 0, 1 }, { 1, 0, 0, 1 } },
{ { 0, 1, -1, 1 }, { 0, 10, 0, 1 }, { 0, 1, 0, 1 } },
};
box2 viewport = { 0, 0, buffer.w, buffer.h };
draw_triangle(buffer, viewport, verts);
stbi_write_png("out.png", buffer.w, buffer.h, 4, buffer.data, buffer.ys);
}
OpenGL shaders
Here are the OpenGL shaders used to generate the reference image.
Vertex shader:
#version 450 core
layout(location = 0) in vec4 position;
layout(location = 1) in vec4 texcoord;
layout(location = 2) in vec4 color;
out gl_PerVertex { vec4 gl_Position; };
layout(location = 0) out Varying { vec4 texcoord; vec4 color; } OUT;
void main() {
OUT.texcoord = texcoord;
OUT.color = color;
gl_Position = vec4(position.x, position.y, -2*position.z - 2*position.w, -position.z);
}
Fragment shader:
#version 450 core
layout(location = 0) in Varying { vec4 texcoord; vec4 color; } IN;
layout(location = 0) out vec4 OUT;
void main() {
OUT = IN.color;
vec2 wrapped = fract(IN.texcoord.xy);
bool brighter = (wrapped.x < 0.5) != (wrapped.y < 0.5);
if(!brighter)
OUT.rgb *= 0.5;
}
Results
Here are the almost identical images generated by the C++ (left) and OpenGL (right) code:
The differences are caused by different precision and rounding modes.
For comparison, here is one that is not perspective correct (uses barycentric instead of perspective for the interpolation in the code above):
The formula that you will find in the GL specification (look on page 427; the link is the current 4.4 spec, but it has always been that way) for perspective-corrected interpolation of the attribute value in a triangle is:
a * f_a / w_a + b * f_b / w_b + c * f_c / w_c
f=-----------------------------------------------------
a / w_a + b / w_b + c / w_c
where a,b,c denote the barycentric coordinates of the point in the triangle we are interpolating for (a,b,c >=0, a+b+c = 1), f_i the attribute value at vertex i, and w_i the clip space w coordinate of vertex i. Note that the barycentric coordinates are calculated only for the 2D projection of the window space coords of the triangle (so z is ignored).
This is what the formulas that ybungalowbill gave in his fine answer boils down to, in the general case, with an arbitrary projection axis. Actually, the last row of the projection matrix defines just the projection axis the image plane will be orthogonal to, and the clip space w component is just the dot product between the vertex coords and that axis.
In the typical case, the projection matrix has (0,0,-1,0) as the last row, so it transfroms so that w_clip = -z_eye, and this is what ybungalowbill used. However, since w is what we actually will do the division by (that is the only nonlinear step in the whole transformation chain), this will work for any projection axis. It will also work in the trivial case of orthogonal projections where w is always 1 (or at least constant).
Note a few things for an efficient implementation of this. The inversion 1/w_i can be pre-calculated per vertex (let's call them q_i in the following), it does not have to be re-evaluated per fragment. And it is totally free since we divide by w anyway, when going into NDC space, so we can save that value. The GL spec does never describe how a certain feature is to be implemented internally, but the fact that the screen space coordinates will be accessible in glFragCoord.xyz, and gl_FragCoord.w is guaranteed to give the (lineariliy interpolated) 1/w clip space coordinate is quite revealing here. That per-fragment 1_w value is actually the denominator of the formula given above.
The factors a/w_a, b/w_b and c/w_c are each used two times in the formula. And these are also constant for any attribute value, now matter how many attributes there are to be interpolated. So, per fragment, you can calculate a'=q_a * a, b'=q_b * b and c'=q_c and get
a' * f_a + b' * f_b + c' * f_c
f=------------------------------
a' + b' + c'
So the perspective interpolation boils down to
3 additional multiplications,
2 additional additions, and
1 additional division
per fragment.

Sprite Kit Shader Uniforms Ignored

The shader that I'm using relies upon the position of the tiles in my game. I haven't found anything on using attribute variables with SKShader objects, so I went with updating the uniform variables. But it would seem that the shader won't communicate with the variables, especially once their values have been updated and changed. I am trying to make a basic lighting effect, but I can't get anything out of the shader at all. Any help? My code for the shader and for the Objective C classes are below.
Shader
uniform float midX, midY;
uniform float posX;
uniform float posY;
void main()
{
vec4 temp = SKDefaultShading(); // get the default shading
float lightRad = 200.0; // Light radius
float dist = distance(vec2(posX, posY), vec2(midX, midY)); // location of the light on the screen
vec4 color = vec4(1.0, 0, 0.0, (float)(dist / lightRad)); // creates an alpha gradient for the light. (falloff)
if (dist < lightRad) // only applies the light color if the distance from the light to the tile is smaller than the radius of the light
{
gl_FragColor = temp * color; // applies the color
}
else // otherwise, do nothing
{
gl_FragColor = temp;
}
}
Code
- (void) loadShaders
{
SKUniform* posX = [SKUniform uniformWithName:#"posX" float: 0.0f]; // adds the x position (with a placeholder value)
SKUniform* posY = [SKUniform uniformWithName:#"posY" float: 0.0f]; // adds the y position (with a placeholder value)
[_shader addUniform:posX];
[_shader addUniform:posY];
}
-(void)update:(CFTimeInterval)currentTime
{
for (int i = 0; i < _array.count; i++) // Loop through all tiles
{
float x = ((i % 100) - 13.5f) * 15.0f; // Calculate x pos of the tile
float y = ((1 - (i / 100)) + 6.5f) * 15.0f; // Calculate y pos of the tile
SKUniform* uniX = [[_tMap getShader] uniformNamed:#"posX"]; // get the uniform with the name posX
uniX.floatValue = x; // set the value of that uniform
SKUniform* uniY = [[_tMap getShader] uniformNamed:#"posY"]; // get the uniform with the name posY
uniY.floatValue = y; // set the value of that uniform
}
}
I'm fairly new to sprite kit, and I'm also new to GLSL.

Why is this basic "rotate around the origin" failing to work?

I've done this a hundred times, but this is my first time with a manually constructed cube made of "sticks", which are 3D lines. It's constructed around the origin, out 5 from the origin in each of the X, Y, and Z directions.
When I rotate it, I'm still "inside it" and it rotates around me (the camera). I'm applying a translation and rotation, so I'm stymied as to what I'm doing wrong.
Here's the basic code to rotate the box, by which I mean generate it's world matrix:
float rotateX = 0.0f, rotateY = 0.0f, rotateZ = 0.0f;
XMFLOAT4 positionBox = XMFLOAT4(0, 0, -50, 1); // Camera at origin looking at this
XMMATRIX matrixCubeWorld;
void CALLBACK OnFrameMove( double fTime, float fElapsedTime, void* pUserContext )
{
auto pCamera = g_GameServices.GetService<CWorldCamera>();
XMMATRIX translation = XMMatrixTranslationFromVector(XMLoadFloat4(&positionBox));
XMMATRIX rotation = XMMatrixRotationRollPitchYaw(rotateX, rotateY, rotateZ);
matrixCubeWorld = rotation * translation;
if (GetKeyState('X') < 0)
rotateX = RotateAround(rotateX, fElapsedTime);
if (GetKeyState('Y') < 0)
rotateY = RotateAround(rotateY, fElapsedTime);
}
And when I set up to draw, I use that matrix:
D3D11_MAPPED_SUBRESOURCE MappedResource;
V(pd3dImmediateContext->Map(_pVertexShaderVariables, 0, D3D11_MAP_WRITE_DISCARD, 0, &MappedResource));
auto pCB = reinterpret_cast<VSCB3DLineChangesEveryFrame *>(MappedResource.pData);
pCB->_gWorldViewProj = matrixCubeWorld * pCamera->GetViewMatrix() * pCamera->GetProjMatrix();
pd3dImmediateContext->Unmap(_pVertexShaderVariables, 0);
return hr;
...and the shader is as simple as can be:
VertexShaderOutput Line3DVertexShaderFunction(float3 position : POSITION, float4 color : COLOR, float2 tex : TEXCOORD0)
{
VertexShaderOutput output;
output.position = mul(float4(position, 1), _gWorldViewProj);
output.color = color;
output.tex = tex;
return output;
}
So do I have a bug or a misunderstanding? I've tried with the inverse of the translation, thinking that would 'bring it back to the origin before rotating' but didn't improve it.
Transformations look good imho.
Maybe it's due to the fact that 'XMMatrixTranslationFromVector'
takes only 3d-vector as the documentation (msdn) says.
Also make sure that RotateAround function and camera view/proj matrices give correct results.
Best regards.

DX11 convert pixel format BGRA to RGBA

I have currently the problem that a library creates a DX11 texture with BGRA pixel format.
But the displaying library can only display RGBA correctly. (This means the colors are swapped in the rendered image)
After looking around I found a simple for-loop to solve the problem, but the performance is not very good and scales bad with higher resolutions. I'm new to DirectX and maybe I just missed a simple function to do the converting.
// Get the image data
unsigned char* pDest = view->image->getPixels();
// Prepare source texture
ID3D11Texture2D* pTexture = static_cast<ID3D11Texture2D*>( tex );
// Get context
ID3D11DeviceContext* pContext = NULL;
dxDevice11->GetImmediateContext(&pContext);
// Copy data, fast operation
pContext->CopySubresourceRegion(texStaging, 0, 0, 0, 0, tex, 0, nullptr);
// Create mapping
D3D11_MAPPED_SUBRESOURCE mapped;
HRESULT hr = pContext->Map( texStaging, 0, D3D11_MAP_READ, 0, &mapped );
if ( FAILED( hr ) )
{
return;
}
// Calculate size
const size_t size = _width * _height * 4;
// Access pixel data
unsigned char* pSrc = static_cast<unsigned char*>( mapped.pData );
// Offsets
int offsetSrc = 0;
int offsetDst = 0;
int rowOffset = mapped.RowPitch % _width;
// Loop through it, BRGA to RGBA conversation
for (int row = 0; row < _height; ++row)
{
for (int col = 0; col < _width; ++col)
{
pDest[offsetDst] = pSrc[offsetSrc+2];
pDest[offsetDst+1] = pSrc[offsetSrc+1];
pDest[offsetDst+2] = pSrc[offsetSrc];
pDest[offsetDst+3] = pSrc[offsetSrc+3];
offsetSrc += 4;
offsetDst += 4;
}
// Adjuste offset
offsetSrc += rowOffset;
}
// Unmap texture
pContext->Unmap( texStaging, 0 );
Solution:
Texture2D txDiffuse : register(t0);
SamplerState texSampler : register(s0);
struct VSScreenQuadOutput
{
float4 Position : SV_POSITION;
float2 TexCoords0 : TEXCOORD0;
};
float4 PSMain(VSScreenQuadOutput input) : SV_Target
{
return txDiffuse.Sample(texSampler, input.TexCoords0).rgba;
}
Obviously iterating over a texture on you CPU is not the most effective way. If you know that colors in a texture are always swapped like that and you don't want to modify the texture itself in your C++ code, the most straightforward way would be to do it in the pixel shader. When you sample the texture, simply swap colors there. You won't even notice any performance drop.

How to convert world coordinates to screen coordinates in OpenGL ES 2.0

I am using following OpenGL ES 1.x code to set my projection coordinates.
glMatrixMode(GL_PROJECTION);
float width = 320;
float height = 480;
glOrthof(0.0, // Left
1.0, // Right
height / width, // Bottom
0.0, // Top
-1.0, // Near
1.0); // Far
glMatrixMode(GL_MODELVIEW);
What is the equivalent method to setup this in OpenGL ES 2.0 ?
What projection matrix should I pass to the vertex shader ?
I have tried following function to create the matrix but its not working:
void SetOrtho (Matrix4x4& m, float left, float right, float bottom, float top, float near,
float far)
{
const float tx = - (right + left)/(right - left);
const float ty = - (top + bottom)/(top - bottom);
const float tz = - (far + near)/(far - near);
m.m[0] = 2.0f/(right-left);
m.m[1] = 0;
m.m[2] = 0;
m.m[3] = tx;
m.m[4] = 0;
m.m[5] = 2.0f/(top-bottom);
m.m[6] = 0;
m.m[7] = ty;
m.m[8] = 0;
m.m[9] = 0;
m.m[10] = -2.0/(far-near);
m.m[11] = tz;
m.m[12] = 0;
m.m[13] = 0;
m.m[14] = 0;
m.m[15] = 1;
}
Vertex Shader :
uniform mat4 u_mvpMatrix;
attribute vec4 a_position;
attribute vec4 a_color;
varying vec4 v_color;
void main()
{
gl_Position = u_mvpMatrix * a_position;
v_color = a_color;
}
Client Code (parameters to the vertex shader):
float min = 0.0f;
float max = 1.0f;
const GLfloat squareVertices[] = {
min, min,
min, max,
max, min,
max, max
};
const GLfloat squareColors[] = {
1, 1, 0, 1,
0, 1, 1, 1,
0, 0, 0, 1,
1, 0, 1, 1,
};
Matrix4x4 proj;
SetOrtho(proj, 0.0f, 1.0f, 480.0/320.0, 0.0f, -1.0f, 1.0f );
The output i am getting in the iPhone simulator:
Your transcription of the glOrtho formula looks correct.
Your Matrix4x4 class is custom, but is it possible that m.m ends up being loaded directly as a glUniformMatrix4fv? If so check that you're setting the transpose flag as GL_TRUE, since you're loading data in row major format and OpenGL expects column major (ie, standard rules are that index [0] is the top of the first column, [3] is at the bottom of the first column, [4] is at the top of the second column, etc).
It's possibly also worth checking that —— assuming you've directly replicated the old world matrix stacks — you're applying modelview and projection in the correct order in your vertex shader or else compositing them correctly on the CPU, whichever way around you're doing it.

Resources