Optimizing vertices for skeletal animation in OpenGL ES - opengl-es

So I'm working with a 2D skeletal animation system.
There are X number of bones, each bone has at least 1 part (a quad, two triangles). On average, I have maybe 20 bones, and 30 parts. Most bones depend on a parent, the bones will move every frame. There are up to 1000 frames in total per animation, and I'm using about 50 animations. A total of around 50,000 frames loaded in memory at any one time. The parts differ between instances of the skeleton.
The first approach I took was to calculate the position/rotation of each bone, and build up a vertex array, which consisted of this, for each part:
[x1,y1,u1,v1],[x2,y2,u2,v2],[x3,y3,u3,v3],[x4,y4,u4,v4]
And pass this through to glDrawElements each frame.
Which looks fine, covers all scenarios that I need, doesn't use much memory, but performs like a dog. On an iPod 4, could get maybe 15fps with 10 of these skeletons being rendered.
I worked out that most of the performance was being eaten up by copying so much vertex data each frame. I decided to go to another extreme, and "pre-calculated" the animations, building up a vertex buffer at the start for each character, that contained the xyuv coordinates for every frame, for every part, in a single character. Then, I calculate the index of the frame that should be used for a particular time, and calculate a delta value, which is passed through to the shader used to interpolate between the current and the next frames XY positions.
The vertices looked like this, per frame
[--------------------- Frame 1 ---------------------],[------- Frame 2 ------]
[x1,y1,u1,v1,boneIndex],[x2, ...],[x3, ...],[x4, ...],[x1, ...][x2, ...][....]
The vertex shader looks like this:
attribute vec4 a_position;
attribute vec4 a_nextPosition;
attribute vec2 a_texCoords;
attribute float a_boneIndex;
uniform mat4 u_projectionViewMatrix;
uniform float u_boneAlpha[255];
varying vec2 v_texCoords;
void main() {
float alpha = u_boneAlpha[int(a_boneIndex)];
vec4 position = mix(a_position, a_nextPosition, alpha);
gl_Position = u_projectionViewMatrix * position;
v_texCoords = a_texCoords;
}
Now, performance is great, with 10 of these on screen, it sits comfortably at 50fps. But now, it uses a metric ton of memory. I've optimized that by losing some precision on xyuv, which are now ushorts.
There's also the problem that the bone-dependencies are lost. If there are two bones, a parent and child, and the child has a keyframe at 0s and 2s, the parent has a keyframe at 0s, 0.5s, 1.5s, 2s, then the child won't be changed between 0.5s and 1.5s as it should.
I came up with a solution to fix this bone problem -- by forcing the child to have keyframes at the same points as the parents. But this uses even more memory, and basically kills the point of the bone hierarchy.
This is where I'm at now. I'm trying to find a balance between performance and memory usage. I know there is a lot of redundant information here (UV coordinates are identical for all the frames of a particular part, so repeated ~30 times). And a new buffer has to be created for every set of parts (which have unique XYUV coordinates -- positions change because different parts are different sizes)
Right now I'm going to try setting up one vertex array per character, which has the xyuv for all parts, and calculating the matrices for each parts, and repositioning them in the shader. I know this will work, but I'm worried that the performance won't be any better than just uploading the XYUV's for each frame that I was doing at the start.
Is there a better way to do this without losing the performance I've gained?
Are there any wild ideas I could try?

The better way to do this is to transform your 30 parts on the fly, not make thousands of copies of your parts in different positions. Your vertex buffer will contain one copy of your vertex data, saving tons of memory. Then each frame can be represented by a set of transformations passed as a uniform to your vertex shader for each bone you draw with a call to glDrawElements(). Each dependent bone's transformation is built relative to the parent bone. Then, depending on where on the continuum between hand crafted and procedurally generated you want your animations, your sets of transforms can take more or less space and CPU computing time.
Jason L. McKesson's free book, Learning Modern 3D Graphics Programming, gives a good explanation on how to accomplish this in chapter 6. The example program at the end of this chapter shows how to use a matrix stack to implement a hierarchical model. I have an OpenGL ES 2.0 on iOS port of this program available.

Related

Summed area table in GLSL and GPU fragment shader execution

I am trying to compute the integral image (aka summed area table) of a texture I have in the GPU memory (a camera capture), the goal being to compute the adaptive threshold of said image. I'm using OpenGL ES 2.0, and still learning :).
I did a test with a simple gaussian blur shader (vertical/horizontal pass), which is working fine, but I need a way bigger variable average area for it to give satisfactory results.
I did implement a version of that algorithm on CPU before, but I'm a bit confused on how to implement that on a GPU.
I tried to do a (completely incorrect) test with just something like this for every fragment :
#version 100
#extension GL_OES_EGL_image_external : require
precision highp float;
uniform sampler2D u_Texture; // The input texture.
varying lowp vec2 v_TexCoordinate; // Interpolated texture coordinate per fragment.
uniform vec2 u_PixelDelta; // Pixel delta
void main()
{
// get neighboring pixels values
float center = texture2D(u_Texture, v_TexCoordinate).r;
float a = texture2D(u_Texture, v_TexCoordinate + vec2(u_PixelDelta.x * -1.0, 0.0)).r;
float b = texture2D(u_Texture, v_TexCoordinate + vec2(0.0, u_PixelDelta.y * 1.0)).r;
float c = texture2D(u_Texture, v_TexCoordinate + vec2(u_PixelDelta.x * -1.0, u_PixelDelta.y * 1.0)).r;
// compute value
float pixValue = center + a + b - c;
// Result stores value (R) and original gray value (G)
gl_FragColor = vec4(pixValue, center, center, 1.0);
}
And then another shader to get the area that I want and then get the average. This is obviously wrong as there's multiple execution units operating at the same time.
I know that the common way of computing a prefix sum on a GPU is to do it in two pass (vertical/horizontal, as discussed here on this thread or or here), but isn't there a problem here as there is a data dependency on each cell from the previous (top or left) one ?
I can't seem to understand the order in which the multiple execution units on a GPU will process the different fragments, and how a two-pass filter can solve that issue. As an example, if I have some values like this :
2 1 5
0 3 2
4 4 7
The two pass should give (first columns then rows):
2 1 5 2 3 8
2 4 7 -> 2 6 13
6 8 14 6 14 28
How can I be sure that, as an example, the value [0;2] will be computed as 6 (2 + 4) and not 4 (0 + 4, if the 0 hasn't been computed yet) ?
Also, as I understand that fragments are not pixels (If I'm not mistaken), would the values I store back in one of my texture in the first pass be the same in another pass if I use the exact same coordinates passed from the vertex shader, or will they be interpolated in some way ?
Tommy and Bartvbl address your questions about a summed-area table, but your core problem of an adaptive threshold may not need that.
As part of my open source GPUImage framework, I've done some experimentation with optimizing blurs over large radii using OpenGL ES. Generally, increasing blur radii leads to a significant increase in texture sampling and calculations per pixel, with an accompanying slowdown.
However, I found that for most blur operations you can apply a surprisingly effective optimization to cap the number of blur samples. If you downsample the image before blurring, blur at a smaller pixel radius (radius / downsampling factor), and then linearly upsample, you can arrive at a blurred image that is the equivalent of one blurred at a much larger pixel radius. In my tests, these downsampled, blurred, and then upsampled images look almost identical to the ones blurred based on the original image resolution. In fact, precision limits can lead to larger-radii blurs done at a native resolution breaking down in image quality past a certain size, where the downsampled ones maintain the proper image quality.
By adjusting the downsampling factor to keep the downsampled blur radius constant, you can achieve near constant-time blurring speeds in the face of increasing blur radii. For a adaptive threshold, the image quality should be good enough to use for your comparisons.
I use this approach in the Gaussian and box blurs within the latest version of the above-linked framework, so if you're running on Mac, iOS, or Linux, you can evaluate the results by trying out one of the sample applications. I have an adaptive threshold operation based on a box blur that uses this optimization, so you can see if the results there are what you want.
AS per the above, it's not going to be fantastic on a GPU. But assuming the cost of shunting data between the GPU and CPU is more troubling it may still be worth persevering.
The most obvious prima facie solution is to split horizontal/vertical as discussed. Use an additive blending mode, create a quad that draws the whole source image then e.g. for the horizontal step on a bitmap of width n issue a call that requests the quad be drawn n times, the 0th time at x = 0, the mth time at x = m. Then ping pong via an FBO, switching the target of buffer of the horizontal draw into the source texture for the vertical.
Memory accesses are probably O(n^2) (i.e. you'll probably cache quite well, but that's hardly a complete relief) so it's a fairly poor solution. You could improve it by divide and conquer by doing the same thing in bands — e.g. for the vertical step, independently sum individual rows of 8, after which the error in every row below the final is the failure to include whatever the sums are on that row. So perform a second pass to propagate those.
However an issue with accumulating in the frame buffer is clamping to avoid overflow — if you're expecting a value greater than 255 anywhere in the integral image then you're out of luck because the additive blending will clamp and GL_RG32I et al don't reach ES prior to 3.0.
The best solution I can think of to that, without using any vendor-specific extensions, is to split up the bits of your source image and combine channels after the fact. Supposing your source image were 4 bit and your image less than 256 pixels in both directions, you'd put one bit each in the R, G, B and A channels, perform the normal additive step, then run a quick recombine shader as value = A + (B*2) + (G*4) + (R*8). If your texture is larger or smaller in size or bit depth then scale up or down accordingly.
(platform specific observation: if you're on iOS then you've hopefully already got a CVOpenGLESTextureCache in the loop, which means you have CPU and GPU access to the same texture store, so you might well prefer to kick this step off to GCD. iOS is amongst the platforms supporting EXT_shader_framebuffer_fetch; if you have access to that then you can write any old blend function you like and at least ditch the combination step. Also you're guaranteed that preceding geometry has completed before you draw so if each strip writes its totals where it should and also to the line below then you can perform the ideal two-pixel-strips solution with no intermediate buffers or state changes)
What you attempt to do cannot be done in a fragment shader. GPU's are by nature very different to CPU's by executing their instructions in parallel, in massive numbers at the same time. Because of this, OpenGL does not make any guarantees about execution order, because the hardware physically doesn't allow it to.
So there is not really any defined order other than "whatever the GPU thread block scheduler decides".
Fragments are pixels, sorta-kinda. They are pixels that potentially end up on screen. If another triangle ends up in front of another, the previous calculated colour value is discarded. This happens regardless of whatever colour was stored at that pixel in the colour buffer previously.
As for creating the summed area table on the GPU, I think you may first want to look at GLSL "Compute Shaders", which are specifically made for this sort of thing.
I think you may be able to get this to work by creating a single thread for each row of pixels in the table, then have every thread "lag behind" by 1 pixel compared to the previous row.
In pseudocode:
int row_id = thread_id()
for column_index in (image.cols + image.rows):
int my_current_column_id = column_index - row_id
if my_current_column_id >= 0 and my_current_column_id < image.width:
// calculate sums
The catch of this method is that all threads should be guaranteed to execute their instructions simultaneously without getting ahead of one another. This is guaranteed in CUDA, but I'm not sure whether it is in OpenGL compute shaders. It may be a starting point for you, though.
It may look surprising for the beginner but the prefix sum or SAT calculation is suitable for parallelization. As the Hensley algorithm is the most intuitive to understand (also implemented in OpenGL), more work-efficient parallel methods are available, see CUDA scan. The paper from Sengupta discuss parallel method which seems state-of-the-art efficient method with reduce and down swap phases. These are valuable materials but they do not enter OpenGL shader implementations in detail. The closest document is the presentation you have found (it refers to Hensley publication), since it has some shader snippets. This is the job which is doable entirely in fragment shader with FBO Ping-Pong. Note that the FBO and its texture need to have internal format set to high precision - GL_RGB32F would be best but I am not sure if it is supported in OpenGL ES 2.0.

Opengl ES 2.0: Model Matrix vs Per Vertex Calculation

I may be asking a silly question but i'm a bit curious about opengl es 2.0 perfomance.
Let's say I have an drawing object that contains a Vertex Array "VA", A Buffer Array "BA", and/or a Model Matrix "MM", and I want to do at least one Translation and one Rotation per frame. So, what is the best alternative?
Do the operations (Rot and Trans) on VA and pass to BA.
Do the operations (Ror and Trans) directly on BA.
Do the operations on MM and pass it to Opengl Vertex Shader.
My conecern is about perfomance, the processing/memory ratio. I think that the 3rd option may be the best because of the GPU, but also the most expensive on terms of memory because every object would have to have a MM, right?
Another Solution that I thought was to pass the translation and rotation parameters to the shaders and assemble the MM on the Shader.
How this is best done?
It is far from a silly question but unfortunately it all depends on the case. Generally even using the vertex buffers on the GPU might not be the best idea if the vertex data is constantly changing but I guess this is not the case you are having.
So the two main differences in what you are thinking would be:
Modify each of the vertex in the CPU and then send the vertex data to the GPU.
Leaving the data on the GPU as it is and change them in the vertex shader by using a matrix.
So the first option is actually good if the vertex data are changing beyond what you can present with a matrix or any other type of analytically presented vertex transformation. For instance if you kept generating random positions on the CPU. In such cases there is actually little sense in even using a vertex buffer since you will need to keep streaming the vertex data every fame anyway.
The second one is great in cases where the base vertex data are relatively static (not changing too much on every frame). You push the vertex data to the GPU once (or once every now and then) and then use the vertex shader to transform the vertex data for you. The vertex shader on the GPU is very affective in doing so and will be much faster then applying the same algorithm on the CPU.
So about your questions:
The third option would most likely be the best if you have significant amount of vertex data but I wouldn't say it is expensive on terms of memory as a matrix consists of 16 floats which should be relatively small since 6 3d vertex positions would take more memory then that so you should not worry about that at all. If anything you should worry about how much data you stream to the GPU which again is the least with this option.
To pass a translation and rotation to the vertex shader and than compose the matrix for every vertex is probably not the best idea. What happens here is you gain a little in traffic to the GPU sending 4+3 floats instead of 16 floats but simply to begin with you send it in two chunks which can produce an overhead. Next to that you consume rather more memory then less since you need to create the matrix in the shader anyway. And if you do that you will be computing a new matrix for every vertex shader which means for each and every vertex.
Now about these matrices and the memory it is hard to say it will actually have any influence on the memory itself. The stack size is usually fixed or at least rounded so adding a matrix into the shader or not will most likely have no difference in any memory consumption at all.
When it comes to openGL and performance you primarily need to watch for:
Memory consumption. This is mostly taken with textures, a 1024x1024 RGBA will take about 4MB which equals to a million floats or about 350k vertices containing a 3D position vectors so something like a matrix really has little effect.
Data stream. This is how much data you need to pass to the GPU on every frame for processing. This should be reduced as much as possible but again sending up to a few MB should not be a problem at all.
Overall efficiency in the shader
Number of draw calls. If possible try to pack as much similar data as possible to reduce the draw calls.

Why is a Sprite Batcher faster?

I am reading Beginning Android Games (Mario Zechner) at the moment.
While reading about 2D games with OpenGL ES 1.0 the author introduces the concept of the SpriteBatcher that takes for each sprite it shall render the coordinates and an angle. The SpriteBatcher then calculates the final coordinates of the sprite rectangle and puts that into a single big buffer.
In the render method the SpriteBatcher sets the state for all the sprites once (texture, blending, vertex buffer, texture coordinates buffer). All sprites use the same texture but not the same texture coordinates.
The advantages of this behavior are:
The rendering pipeline does not stall, since there are no state changes while rendering all the sprites.
There are less OpenGL calls. (= less JNI overhead)
But I see a major disadvantage:
For rotation the CPU has to calculate the sine and cosine and perform 16 multiplication for each sprite. As far as I know calculating sine and cosine is very expensive and slow.
But the SpriteBatcher approach is lots faster than using lots of glRotate/glTranslate for rendering the sprites one by one.
Finally my questions:
Why is it faster? Are OpenGL state changes really that expensive?
The GPU is optimized for vector multiplications and rotations, while the CPU is not. Why doesn't that matter?
Would one use a SpriteBatcher on a desktop with a dedicated GFX-card?
Is there a point where the SpriteBatcher becomes inefficient?
But I see a major disadvantage:
For rotation the CPU has to calculate the sine and cosine and perform 16 multiplication for each sprite. As far as I know calculating sine and cosine is very expensive and slow.
Actually sin and cos are quite fast, on modern architectures they take 1 clock cycle to execute, if the pipeline has not been stalled before. However if the each sprite is rotated individually and an ordinary frustum perspective projection is used, the author of this code doesn't know his linear algebra.
The whole task can be simplified a lot if one recalls, that the modelview matrix maps linear local/world coordinates map to eye space. The rotation is in the upper left 3×3 submatrix, the column forming the local base vectors. By taking the inverse of this submatrix you're given exactly those vectors you need as sprite base, to map planar into eye space. In case of only rotations (and scaling, maybe) applied, the inverse of the upper left 3×3 is the transpose; so by using the upper left 3×3 rows as the sprite base you get that effect without doing any trigonometry at all:
/* populates the currently bound VBO with sprite geometry */
void populate_sprites_VBO(std::vector<vec3> sprite_positions)
{
GLfloat mv[16];
GLfloat sprite_left[3];
GLfloat sprite_up[3];
glGetMatrixf(GL_MODELVIEW_MATRIX, mv);
for(int i=0; i<3; i++) {
sprite_left[i] = mv[i*4];
sprite_up[i] = mv[i*4 + 4];
}
std::vector<GLfloat> sprite_geom;
for(std::vector<vec3>::iterator sprite=sprite_positions.begin(), end=sprite_positions.end();
sprite != end;
sprite++ ){
sprite_geom.append(sprite->x + (-sprite_left[0] - sprite_up[0])*sprite->scale);
sprite_geom.append(sprite->y + (-sprite_left[1] - sprite_up[1])*sprite->scale);
sprite_geom.append(sprite->z + (-sprite_left[2] - sprite_up[2])*sprite->scale);
sprite_geom.append(sprite->x + ( sprite_left[0] - sprite_up[0])*sprite->scale);
sprite_geom.append(sprite->y + ( sprite_left[1] - sprite_up[1])*sprite->scale);
sprite_geom.append(sprite->z + ( sprite_left[2] - sprite_up[2])*sprite->scale);
sprite_geom.append(sprite->x + ( sprite_left[0] + sprite_up[0])*sprite->scale);
sprite_geom.append(sprite->y + ( sprite_left[1] + sprite_up[1])*sprite->scale);
sprite_geom.append(sprite->z + ( sprite_left[2] + sprite_up[2])*sprite->scale);
sprite_geom.append(sprite->x + (-sprite_left[0] + sprite_up[0])*sprite->scale);
sprite_geom.append(sprite->y + (-sprite_left[1] + sprite_up[1])*sprite->scale);
sprite_geom.append(sprite->z + (-sprite_left[2] + sprite_up[2])*sprite->scale);
}
glBufferData(GL_ARRAY_BUFFER,
sprite_positions.size() * sizeof(sprite_positions[0]), &sprite_positions[0],
GL_DRAW_STREAM);
}
If shaders are available, then instead of rebuilding the sprite data on CPU each frame, one could use the geometry shader or the vertex shader. A geometry shader would take a vector of position, scale, texture, etc. and emit the quads. Using a vertex shader you'd send a lot of [-1,1] quads, where each vertex would carry the center position of the sprite it belongs to as an additional vec3 attribute.
Finally my questions:
Why is it faster? Are OpenGL state changes really that expensive?
Some state changes are extremely expensive, you'll try to avoid those, wherever possible. Switching textures is very expensive, switching shaders is mildly expensive.
The GPU is optimized for vector multiplications and rotations, while the CPU is not. Why doesn't that matter?
This is not the difference between GPU and CPU. Where a GPU differs from a CPU is, that it performs the same sequence of operations on a huge chunk of records in parallel (each pixel of the framebuffer rendered to). A CPU on the other hand runs the program one record at a time.
But CPUs do vector operations just as well, if not even better than GPUs. Especially where precision matters CPUs are still preferred over GPUs. MMX, SSE and 3DNow! are vector math instruction sets.
Would one use a SpriteBatcher on a desktop with a dedicated GFX-card?
Probably not in this form, since today one has geometry and vertex shaders available, liberating the CPU for other things. But more importantly this saves bandwidth between CPU and GPU. Bandwidth is the tighter bottleneck, processing power is not the number one problem these days (of course one never has enough processing power).
Is there a point where the SpriteBatcher becomes inefficient?
Yes, namely the CPU → GPU transfer bottleneck. Today one uses geometry shaders and instancing to do this kind of thing, really fast.
I don't know about SpriteBatcher, but looking at the information you provided here are my thoughts:
It is faster, because it uses less state changes and, what is more important, less draw calls. Mobile platforms have especially strict constraints on draw call number per frame.
That doesn't matter because, probably, they are using CPU for rotations. I, personally, see no reason not to use GPU for that, which would be way faster and nullify bandwidth load.
I guess it would still be a good optimization considering point 1.
I can mind two extreme cases: when there are too few sprites or when the compound texture (containing all rotated sprites) grows too big (mobile devices have lower size limits).

For an arbitrary number of transformations in OpenGL ES 2.0, where do you calculate model and view matrices?

I'm writing a small 2D game engine in OpenGL ES 2.0. It works, but for medium sized scenes it feels a little sluggish currently. I designed it so that every game object is a tree of nodes, and each node is a primitive shape (triangle, square, circle). And every node can have an arbitrary set of transformations applied to it at creation and also at runtime.
To illustrate, a "head" node is a circle, and it has a child "hat" node that is a triangle with a translation transform to move it to the top of the circle. Now, at runtime, I can move the head around with an animated translation transformation on the head, and the hat moves with it. Or I can animate a "hat tip" by applying a rotation transformation just on the hat, dynamically at runtime.
On render, every node applies its own static transformations (the hat moving up), then any dynamic translations (the hat tip), and then so on for every parent node. There are three matrices per node plus another three for each applied dynamic animation. For deep trees, this adds up to a lot of matrix math.
This seems like a good thing to push to the GPU if possible, but since animations are applied dynamically I don't know ahead of time how many transforms each node will undergo in order to write a shader to handle it. I'm new to OpenGL ES 2.0 and game engine design both and don't know all limitations.
My questions are...
Am I radically out of line with "good" game engine design?
Is this indeed a task for the CPU or the GPU?
Can an OpenGL 2.0 ES shader be written to handle an arbitrary number of transformations that conform to my "object tree" design and run-time applied animation matrices?
Moving the transformation hierachy calculations to the GPU is a bad idea. Shaders operate on a per-primitive/per-vertex/per-fragment level. So you'll carry out those calculations for each and every vertex you draw. Not very efficient.
You should really optimize the way you're doing your animations. For example you don't need 3 matrices per node. One matrix contains the whole transformation. Every 4×4 matrix-matrix multiplication involves 64 floating point multiplcations. So you've 64⁴ multiplications for each node. Cut that out!
A good way to optimize the animation system is by separation of the single parameters. Use quaternios for the rotation; quaternions take only 8 scalar multiplications, store the translation as a 3 vector, the same with scaling. Then compose the single transformation matrix from those parts. You can translate a quaternion directly into the 3×3 upper left part, describing the rotation, use the scaling vector as factor on the columns. The translation goes into the 4th row. Element 4,4 is 1.

Fastest way to to take coordinates from model space, to canonical coordinates space in OpenGL ES 2.0

Like many 3d graphical programs, I have a bunch of objects that have their own model coordinates (from -1 to 1 in x, y, and z axis). Then, I have a matrix that takes it from model coordinates to world coordinates (using the location, rotation, and scale of the object being drawn). Finally, I have a second matrix to turn those world coordinates into canonical coordinates that OopenGL ES 2.0 will use to draw to the screen.
So, because one object can contain many vertices, all of which use the same transform into both world space, and canonical coordinates, it's faster to calculate the product of those two matrices once, and put each vertex through the resulting matrix, rather than putting each vertex through both matrices.
But, as far as I can tell, there doesn't seem to be a way in OpenGL ES 2.0 shaders to have it calculate the matrix once, and keep using it until the one of the two matrices used until glUniformMatrix4fv() (or another function to set a uniform) is called. So it seems like the only way to calculate the matrix once would be to do it on the CPU, and then result to the GPU using a uniform. Otherwise, when something like:
gl_Position = uProjection * uMV * aPosition;
it will calculate it over and over again, which seems like it would waste time.
So, which way is usually considered standard? Or is there a different way that I am completely missing? As far as I could tell, the shader used to implement the OpenGL ES 1.1 pipeline in the OpenGL ES 2.0 Programming Guide only used one matrix, so is that used more?
First, the correct OpenGL term for "canonical coordinates" is clip space.
Second, it should be this:
gl_Position = uProjection * (uMV * aPosition);
What you posted does a matrix/matrix multiply followed by a matrix/vector multiply. This version does 2 matrix/vector multiplies. That's a substantial difference.
You're using shader-based hardware; how you handle matrices is up to you. There is nothing that is "considered standard"; you do what you best need to do.
That being said, unless you are doing lighting in model space, you will often need some intermediary between model space and 4D homogeneous clip-space. This is the space you transform the positions and normals into in order to compute the light direction, dot(N, L), and so forth.
Personally, I wouldn't suggest world space for reasons that I explain thoroughly here. But whether it's world space, camera space, or something else, you will generally have some intermediate space that you need positions to be in. At which point, the above code becomes necessary, and thus there is no time wasted.

Resources