How to measure GPU GFLOPS on a non-unified shader system? - performance

my question is because there are some old GPUs that does have vertex shaders and pixel shaders, I don't know how I can measure GFLOPS with that kind of GPU.
I know you can measure GFLOPS using Core Speed x ALUs x 2 (I don't know what this "2" is, if someone can answer that too it would be great!). But for a GPU that doesn't have unified shaders how can I measure it?
Thanks in advance.

I think that it can be measured independently. I mean, you should measure vertex shader performance without pixel shader work (for example, draw backface culled triangles), then measure pixel shader performance without (significant) vertex shader work (like full screen triangles). Then add the two numbers together.

Related

rendering millions of voxels using 3D textures with three.js

I am using three.js to render a voxel representation as a set of triangles. I have got it render 5 million triangles comfortably but that seems to be the limit. you can view it online here.
select the Dublin model at resolution 3 to see a lot of triangles being drawn.
I have used every trick to get it this far (buffer geometry, voxel culling, multiple buffers) but I think it has hit the maximum amount that openGL triangles can accomplish.
Large amounts of voxels are normally rendered as a set of images in a 3D texture and while there are several posts on how to hack 2d textures into 3D textures but they seem to have a maximum limit on the texture size.
I have searched for tutorials or examples using this approach but haven't found any. Has anyone used this approach before with three.js
Your scene is render twice, because SSAO need depth texture. You could use WEBGL_depth_texture extension - which have pretty good support - so you just need a single render pass. You can stil fallback to low-perf-double-pass if extension is unavailable.
Your voxel's material is double sided. It's may be on purpose, but it may create a huge overdraw.
In your demo, you use a MeshPhongMaterial and directional lights. It's a needlessly complex material. Your geometries don't have any normals so you can't have any lighting. Try to use a simpler unlit material.
Your goal is to render a huge amount of vertices, so assuming the framerate is bound by vertex shader :
try stuff like https://github.com/GPUOpen-Tools/amd-tootle to preprocess your geometries. Focusing on prefetch vertex cache and vertex cache.
reduce the bandwidth used by your vertex buffers. Since your vertices are aligned on a "grid", you could store vertices position as 3 Shorts instead of 3 floats, reducing your VBO size by 2. You could use a same tricks if you had normals since all normals should be Axis aligned (cubes)
generally reduce the amount of varyings needed by fragment shader
if you need more attributes than just vec3 position, use one single interleaved VBO instead of one per attrib.

THREEjs: GLSL Vertex Position to CPU

This is more of a GPU to CPU question, but I'd like to ask THREEjs people if they have any insights. In summary:
I have a large mesh requiring GPU calculations.
After the user interacts with the mesh, the user may elect to export the mesh based on the vertex positions of the GLSL shader (using OBJExporter). These need to be world positions, not screen positions (imagine a 3D print of the model)
I realize that it is expensive to go from GPU to CPU, but can anyone suggest how to do it? I'm not asking for code, but broad ideas would be very helpful as I haven't found much regarding WebGL specific workflows (see here).
How does one get the world XYZ of each vertex in a shader in order to recreate the geometry from the CPU for export?

Bump map sprite casting shadows on itself

I've got a fairly simple implementation of normal map lighting working for 2D sprites in webgl (GLSL shaders) which I was able to adapt & optimize from an example. It uses just one directional light and works fine for my purposes. Sprites are rendered flat (2D), only the light direction and normals are 3D vectors. Vertex rotation only happens around the z axis, so it's fairly easy-peasy.
I was hoping to add a bump (height) map to cast shadows. There are 3D bump map shadow casting examples and papers available online, but they're more complex than I need and the math goes over my head; I haven't found an example or explanation of how one might do a simple 2D case.
My first inclination is as follows: for the current pixel in the fragment shader, trace back along the direction of the light and check the altitude of the neighbouring bump map pixel. If it's higher than the light direction vector at that point, then that pixel is in the shade. However since "tall" pixels on the bump map may cast shadow across > 1 pixel distance, I'd have to keep testing pixel by pixel in that direction until I find one tall enough to cast a shadow (or reach the edge of the texture, or reach some arbitrary limit.)
This doesn't sound very optimal, especially for larger textures. I've read that if statements in shaders aren't so fast. Is there a faster/better method?
What you are looking for is called parallax (occlusion) mapping.
It's a technique that does exactly what you described, and it can be understood as on-bumpmap ray tracing in tangent space.
Here are some articles:
nVidia - Per-Pixel displacement (w/ sphere tracing)
nVidia - Cone Tracing for PM
AMD - POM
The ways to optimize search are similar to ordinary raytracing and include: sphere tracing, cone tracing, binary search and similar, instead of constant stepping function.
P. S. If you know the name of some rendering technique, it's generally good idea to Google it adding 'nVidia', 'crytek' or 'gpu' in front of the name, it will show you much more relevant results.
Hope this helps.

performance - drawing many 2d circles in opengl

I am trying to draw large numbers of 2d circles for my 2d games in opengl. They are all the same size and have the same texture. Many of the sprites overlap. What would be the fastest way to do this?
an example of the kind of effect I'm making http://img805.imageshack.us/img805/6379/circles.png
(It should be noted that the black edges are just due to the expanding explosion of circles. It was filled in a moment after this screen-shot was taken.
At the moment I am using a pair of textured triangles to make each circle. I have transparency around the edges of the texture so as to make it look like a circle. Using blending for this proved to be very slow (and z culling was not possible as they were rendered as squares to the depth buffer). Instead I am not using blending but having my fragment shader discard any fragments with an alpha of 0. This works, however it means that early z is not possible (as fragments are discarded).
The speed is limited by the large amounts of overdraw and the gpu's fillrate. The order that the circles are drawn in doesn't really matter (provided it doesn't change between frames creating flicker) so I have been trying to ensure each pixel on the screen can only be written to once.
I attempted this by using the depth buffer. At the start of each frame it is cleared to 1.0f. Then when a circle is drawn it changes that part of the depth buffer to 0.0f. When another circle would normally be drawn there it is not as the new circle also has a z of 0.0f. This is not less than the 0.0f that is currently there in the depth buffer so it is not drawn. This works and should reduce the number of pixels which have to be drawn. However; strangely it isn't any faster. I have already asked a question about this behavior (opengl depth buffer slow when points have same depth) and the suggestion was that z culling was not being accelerated when using equal z values.
Instead I have to give all of my circles separate false z-values from 0 upwards. Then when I render using glDrawArrays and the default of GL_LESS we correctly get a speed boost due to z culling (although early z is not possible as fragments are discarded to make the circles possible). However this is not ideal as I've had to add in large amounts of z related code for a 2d game which simply shouldn't require it (and not passing z values if possible would be faster). This is however the fastest way I have currently found.
Finally I have tried using the stencil buffer, here I used
glStencilFunc(GL_EQUAL, 0, 1);
glStencilOp(GL_KEEP, GL_INCR, GL_INCR);
Where the stencil buffer is reset to 0 each frame. The idea is that after a pixel is drawn to the first time. It is then changed to be none-zero in the stencil buffer. Then that pixel should not be drawn to again therefore reducing the amount of overdraw. However this has proved to be no faster than just drawing everything without the stencil buffer or a depth buffer.
What is the fastest way people have found to write do what I am trying?
The fundamental problem is that you're fill limited, which is the GPUs inability to shade all the fragments you ask it to draw in the time you're expecting. The reason that you're depth buffering trick isn't effective is that the most time-comsuming part of processing is shading the fragments (either through your own fragment shader, or through the fixed-function shading engine), which occurs before the depth test. The same issue occurs for using stencil; shading the pixel occurs before stenciling.
There are a few things that may help, but they depend on your hardware:
render your sprites from front to back with depth buffering. Modern GPUs often try to determine if a collection of fragments will be visible before sending them off to be shaded. Roughly speaking, the depth buffer (or a represenation of it) is checked to see if the fragment that's about to be shaded will be visible, and if not, it's processing is terminated at that point. This should help reduce the number of pixels that need to be written to the framebuffer.
Use a fragment shader that immediately checks your texel's alpha value, and discards the fragment before any additional processing, as in:
varying vec2 texCoord;
uniform sampler2D tex;
void main()
{
vec4 texel = texture( tex, texCoord );
if ( texel.a < 0.01 ) discard;
// rest of your color computations
}
(you can also use alpha test in fixed-function fragment processing, but it's impossible to say if the test will be applied before the completion of fragment shading).

opengles 2.0 2D scene graph implementation

I want to create a 2d opengles engine to use for my apps. This engine should support a scene graph. Each node of the graph can have it's own shader, texture, material and transformation matrix according to it's parent node. But I'm new to opengles 2.0 and so I have some questions:
How does the matrix multiplication work in opengles 2.0? Is it a good approach to draw first parent, then it's child nodes (will it give some optimization when multiplying modelview matrices). Where does the matrix multiplication take place? Should i do it on the CPU or on GPU.
Is it a better approach to draw nodes according to the shader it uses, then to texture and material? How should i implement scene graph transformations in this case (on CPU or GPU and how)?
How does the matrix multiplication work in opengles 2.0? Is it a good approach to draw first parent, then it's child nodes (will it give some optimization when multiplying modelview matrices). Where does the matrix multiplication take place? Should i do it on the CPU or on GPU.
Except some ancient SGI machines, transformation matrix multiplication always took place on the CPU. While it is possible to do matrix multiplication in a shader, this should not be used for implementing a transformation hierachy. You should use, or implement a small linear algebra library. If it is tailored for 4x4 homogenous tranformations, those used in 3D graphics, it can be implemented in under 1k lines of C code.
Instead of relying on the old OpenGL matrix stack, which has been deprecated and removed from later versions, for every level in the transformation stack you create a copy of the current matrix, apply the next transform on it and supply the new matrix to the transformation uniform.
Is it a better approach to draw nodes according to the shader it uses, then to texture and material? How should i implement scene graph transformations in this case (on CPU or GPU and how)?
Usually one uses a two phase approach to rendering. In the first phase you collect all the information about which objects to draw and the drawing input data (shaders, textures, transformation matrices) and put this in a list. Then you sort that list according to some criteria. One wants to minimize the total cost of state changes.
The most expensive kind of state change is switching textures as this invalidates the caches. Swapping textures between texturing units has some cost but its still cheaper than switching shaders, which invalidates the state of the execution path predictor. Changing uniform data however is very cheap, so one should not be anxious to switch uniforms too often. So if you can make substantial settings through uniforms, do it that way. However you must be aware that conditional code in shaders has a performance it there, so you must balance the cost of switching shaders against the cost of conditional shader code.

Resources