Opengl ES 2.0: Model Matrix vs Per Vertex Calculation - performance

I may be asking a silly question but i'm a bit curious about opengl es 2.0 perfomance.
Let's say I have an drawing object that contains a Vertex Array "VA", A Buffer Array "BA", and/or a Model Matrix "MM", and I want to do at least one Translation and one Rotation per frame. So, what is the best alternative?
Do the operations (Rot and Trans) on VA and pass to BA.
Do the operations (Ror and Trans) directly on BA.
Do the operations on MM and pass it to Opengl Vertex Shader.
My conecern is about perfomance, the processing/memory ratio. I think that the 3rd option may be the best because of the GPU, but also the most expensive on terms of memory because every object would have to have a MM, right?
Another Solution that I thought was to pass the translation and rotation parameters to the shaders and assemble the MM on the Shader.
How this is best done?

It is far from a silly question but unfortunately it all depends on the case. Generally even using the vertex buffers on the GPU might not be the best idea if the vertex data is constantly changing but I guess this is not the case you are having.
So the two main differences in what you are thinking would be:
Modify each of the vertex in the CPU and then send the vertex data to the GPU.
Leaving the data on the GPU as it is and change them in the vertex shader by using a matrix.
So the first option is actually good if the vertex data are changing beyond what you can present with a matrix or any other type of analytically presented vertex transformation. For instance if you kept generating random positions on the CPU. In such cases there is actually little sense in even using a vertex buffer since you will need to keep streaming the vertex data every fame anyway.
The second one is great in cases where the base vertex data are relatively static (not changing too much on every frame). You push the vertex data to the GPU once (or once every now and then) and then use the vertex shader to transform the vertex data for you. The vertex shader on the GPU is very affective in doing so and will be much faster then applying the same algorithm on the CPU.
So about your questions:
The third option would most likely be the best if you have significant amount of vertex data but I wouldn't say it is expensive on terms of memory as a matrix consists of 16 floats which should be relatively small since 6 3d vertex positions would take more memory then that so you should not worry about that at all. If anything you should worry about how much data you stream to the GPU which again is the least with this option.
To pass a translation and rotation to the vertex shader and than compose the matrix for every vertex is probably not the best idea. What happens here is you gain a little in traffic to the GPU sending 4+3 floats instead of 16 floats but simply to begin with you send it in two chunks which can produce an overhead. Next to that you consume rather more memory then less since you need to create the matrix in the shader anyway. And if you do that you will be computing a new matrix for every vertex shader which means for each and every vertex.
Now about these matrices and the memory it is hard to say it will actually have any influence on the memory itself. The stack size is usually fixed or at least rounded so adding a matrix into the shader or not will most likely have no difference in any memory consumption at all.
When it comes to openGL and performance you primarily need to watch for:
Memory consumption. This is mostly taken with textures, a 1024x1024 RGBA will take about 4MB which equals to a million floats or about 350k vertices containing a 3D position vectors so something like a matrix really has little effect.
Data stream. This is how much data you need to pass to the GPU on every frame for processing. This should be reduced as much as possible but again sending up to a few MB should not be a problem at all.
Overall efficiency in the shader
Number of draw calls. If possible try to pack as much similar data as possible to reduce the draw calls.

Related

Precalculating OpenGL model transformations for static world space

I'm working on an OpenGL visualisation for navigating a 3D dataset. Briefly, the visualisation takes in a large (~1 million data points) array of matrices, which are then eigendecomposed and visualised as ellipsoids.
I have found that performance improves significantly when I calculate ellipsoid vertex transformations "up-front" (i.e. calculate all model transformations once only on the CPU), rather than in shaders (where the model transformations have to be calculated for each draw). For scene navigation/lighting etc., view and projection tranformations are calculated as normal as uniforms passed to the relevant shaders.
The result of this approach is the program taking longer to initialise (due to the CPU being tied up calculating all the model transformations), but significantly higher frame rates.
I understand from this, that it is common to decompose matrices to avoid unnecessary shader computations, however I haven't come across anything describing this practice of completely pre-calculating the world space.
I understand that this approach is only appropriate for my narrow usecase (i.e. where the scene is static, meaning there will never be a situation where a vertex's position in world space will change while the program is running). Apart from that, are there any significant reasons that I should avoid doing this?
It's a common optimization to remove redundant transformations from static objects. Your objects are static in the world, so you've collapsed all the redundant transformations right up to the root of your scene, which is not a problem.
Having said that, the performance gain you're seeing is probably not coming from the cost of doing the model transform in the shader, but from passing that transform to the shader for each object. You have not said much about how you organize the ellipsoids, but if you are updating a program with the model matrix uniform and issuing a DrawElements call for each ellipsoid, that is very slow indeed. Even doing something more exotic -- like using instances and passing each transform in a VBO -- you would still have the overhead of updating them,which you can now avoid. If you are not doing this already, you can group your ellipsoid vertices into large arrays and draw them with only a few DrawElements calls.

Optimized GPU to CPU data transfer

I'm a bit out of my depth here (best way to be me thinks), but I am poking around looking for an optimization that could reduce GPU to CPU data transfer for my application.
I have an application that performs some modifications to vertex data in the GPU. Occasionally the CPU has to read back parts of the modified vertex data and then compute some parameters which then get passed back into the GPU shader via uniforms, forming a loop.
It takes too long to transfer all the vertex data back to the CPU and then sift through it on the CPU (millions of points), and so I have a "hack" in place to reduce the workload to usable, although not optimal.
What I do:
CPU: read image
CPU: generate 1 vertex per pixel, Z based on colour information/filter etc
CPU: transfer all vertex data to GPU
GPU: transform feedback used to update GL_POINT vertex coords in realtime based on some uniform parameters set from the CPU.
When I wish to read only a rectangular "section", I use glMapBufferRange to map the entire rows that comprise the desired rect (bad diagram alert):
This is supposed to represent the image/set of vertices in the GPU. My "hack" involves having to read all the blue and red vertices. This is because I can only specify 1 continuous range of data to read back.
Does anyone know a clever way to efficiently get at the red, without the blue? (without having to issue a series of glMapBufferRange calls)
EDIT-
The use case is that I render the image into a 3D world as GLPoints, coloured and offset in the Z by an amount based on the colour info (sized etc according to distance). Then the user can modify the vertex Z data with a mouse cursor brush. The logic behind some of the brush application code needs to know the Z's of the area under the mouse (brush circle), eg. min/max/average etc so that the CPU can control the shaders modification of data by setting a series of uniforms that feed into the shader. So for example the user can say, I want all points under the cursor to set to the average value. This could all probably be done entirely in the GPU, but the idea is that once I get the CPU-GPU "loop" (optimised as far as I can reasonably do), I can then expand out the min/max/avg stuff to do interesting things on the CPU that would be cumbersome (probably) to do entirely on the GPU.
Cheers!
Laythe
To get any data from the GPU to the CPU you need to map the GPU memory in any case, which means the OpenGL application will have to use something like mmap under the hood. I have checked the implementation of that for both x86 and ARM, and it looks like it is page-aligned, so you cannot map less than 1 contiguous page of GPU memory at any given time, so even if you could request to map just the red areas, you quite likely would also get the blue ones as well (depending on your page and pixel data sizes).
Solution 1
Just use glReadPixels, as this allows you to select a window of the framebuffer. I assume a GPU vendor like Intel would optimize the driver, so it would map as few pages as possible, however this is not guaranteed, and in some cases you may need to map 2 pages just for 2 pixels.
Solution 2
Create a compute shader or use several glCopyBufferSubData calls to copy your region of interest into a contiguous buffer in GPU memory. If you know the height and width you want, you can then un-mangle and get a 2D buffer back on the CPU side.
Which of the above solutions works better depends on your hardware and driver implementation. If GPU->CPU is the bottleneck and GPU->GPU is fast, then the second solution may work well, however you would have to experiment.
Solution 3
As suggested in the comments, do everything on the GPU. This heavily depends on whether the work is parallelize-able well, but if the copying of memory is too slow for you, then you don't have much other choice.
I suppose you are asking because you can not do all work at shaders, right?
If you render to a Frame Buffer Object, then bind it as GL_READ_FRAMEBUFFER, you can read a block of it by glReadPixels.

Efficiently rendering a transparent terrain in OpenGL

I'm writing an OpenGL program that visualizes caves, so when I visualize the surface terrain I'd like to make it transparent, so you can see the caves below. I'm assuming I can normalize the data from a Digital Elevation Model into a grid aligned to the X/Z axes with regular spacing, and render each grid cell as two triangles. With an aligned grid I could avoid the cost of sorting when applying the painter's algorithm (to ensure proper transparency effects); instead I could just render the cells row by row, starting with the farthest row and the farthest cell of each row.
That's all well and good, but my question for OpenGL experts is, how could I draw the terrain most efficiently (and in a way that could scale to high resolution terrains) using OpenGL? There must be a better way than calling glDrawElements() once for every grid cell. Here are some ways I'm thinking about doing it (they involve features I haven't tried yet, that's why I'm asking the experts):
glMultiDrawElements Idea
Put all the terrain coordinates in a vertex buffer
Put all the coordinate indices in an element buffer
To draw, write the starting indices of each cell into an array in the desired order and call glMultiDrawElements with that array.
This seems pretty good, but I was wondering if there was any way I could avoid transferring an array of indices to the graphics card every frame, so I came up with the following idea:
Uniform Buffer Idea
This seems like a backward way of using OpenGL, but just putting it out there...
Put the terrain coordinates in a 2D array in a uniform buffer
Put coordinate index offsets 0..5 in a vertex buffer (they would have to be floats, I know)
call glDrawArraysInstanced - each instance will be one grid cell
the vertex shader examines the position of the camera relative to the terrain and determines how to order the cells, mapping gl_instanceId to the index of the first coordinate of the cell in the Uniform Buffer, and setting gl_Position to the coordinate at this index + the index offset attribute
I figure there might be shiny new OpenGL 4.0 features I'm not aware of that would be more elegant than either of these approaches. I'd appreciate any tips!
The glMultiDrawElements() approach sounds very reasonable. I would implement that first, and use it as a baseline you can compare to if you try more complex approaches.
If you have a chance to make it faster will depend on whether the processing of draw calls is an important bottleneck in your rendering. Unless the triangles you render are very small, and/or your fragment shader very simple, there's a good chance that you will be limited by fragment processing anyway. If you have profiling tools that allow you to collect data and identify bottlenecks, you can be much more targeted in your optimization efforts. Of course there is always the low-tech approach: If making the window smaller improves your performance, chances are that you're mostly fragment limited.
Back to your question: Since you asked about shiny new GL4 features, another method you could check out is indirect rendering, using glDrawElementsIndirect(). Beyond being more flexible, the main difference to glMultiDrawElements() is that the parameters used for each draw, like the start index in your case, can be sourced from a buffer. This might prevent one copy if you map this buffer, and write the start indices directly to the buffer. You could even combine it with persistent buffer mapping (look up GL_MAP_PERSISTENT_BIT) so that you don't have to map and unmap the buffer each time.
Your uniform buffer idea sounds pretty interesting. I'm slightly skeptical that it will perform better, but that's just a feeling, and not based on any data or direct experience. So I think you absolutely should try it, and report back on how well it works!
Stretching the scope of your question some more, you could also look into approaches for order-independent transparency rendering if you haven't considered and rejected them already. For example alpha-to-coverage is very easy to implement, and almost free if you would be using MSAA anyway. It doesn't produce very high quality transparency effects based on my limited attempts, but it could be very attractive if it does the job for your use case. Another technique for order-independent transparency is depth peeling.
If some self promotion is acceptable, I wrote an overview of some transparency rendering methods in an earlier answer here: OpenGL ES2 Alpha test problems.

OpenGL Optimization - Duplicate Vertex Stream or Call glDrawElements Repeatedly?

This is for an OpenGL ES 2.0 game on Android, though I suspect the right answer is generic to any opengl situation.
TL;DR - is it better to send N data to the gpu once and then make K draw calls with it; or send K*N data to the gpu once, and make 1 draw call?
More Details I'm wondering about best practices for my situation. I have a dynamic mesh whose vertices I recompute every frame - think of it as a water surface - and I need to project these vertices onto K different quads in my game. (In each case the projection is slightly different; sparing details, you could imagine them as K different mirrors surrounding the mesh.) K is in the order of 10-25; I'm still figuring it out.
I can think of two broad options:
Bind the mesh as is, and call draw K different times, either
changing a uniform for shaders or messing with the fixed function
state to render to the correct quad in place (on the screen) or to different
segments of a texture (which I can later use when rendering the quads to achieve
the same effect).
Duplicate all the vertices in the mesh K times, essentially making a
single vertex stream with K meshes in it, and add an attribute (or
few) indicating which quad each mesh clone is supposed to project
onto (and how to get there), and use vertex shaders to project. I
would make one call to draw, but send K times as much data.
The Question: of those two options, which is generally better performance wise?
(Additionally: is there a better way to do this?
I had considered a third option, where I rendered the mesh details to a texture, and created my K-clone geometry as a sort of dummy stream, which I could bind once and for all, that looked up in a vertex shader into the texture for each vertex to find out what vertex it really represented; but I've been told that texture support in vertex shaders is poor or prohibited in OpenGL ES 2.0 and would prefer to avoid that route.)
There is no perfect answer to this question, though I would suggest you think about the nature of real-time computer graphics and the OpenGL pipeline. Although "the GL" is required to produce results that are consistent with in-order execution, the reality is that GPUs are highly parallel beasts. They employ lots of tricks that work best if you actually have many unrelated tasks going on at the same time (some even split the whole pipeline up into discrete tiles). GDDR memory, for instance is really high latency, so for efficiency GPUs need to be able to schedule other jobs to keep the stream processors (shader units) busy while memory is fetched for a job that is just starting.
If you are recomputing parts of your mesh each frame, then you will almost certainly want to favor more draw calls over massive CPU->GPU data transfers every frame. Saturating the bus with unnecessary data transfers plagues even PCI Express hardware (it is far slower than the overhead that several additional draw calls would ever add), it can only get worse on embedded OpenGL ES systems. Having said that, there is no reason you could not simply do glBufferSubData (...) to stream in only the affected portions of your mesh and continue to draw the entire mesh in a single draw call.
You might get better cache coherency if you split (or partition the data within) the buffer and/or draw calls up, depending on your actual use case scenario. The only way to decisively tell which is going to work better in your case is to profile your software on your target hardware. But all of this fail to look at the bigger picture, which is: "Why am I doing this on the CPU?!"
It sounds like what you really want is simply vertex instancing. If you can re-work your algorithm to work completely in vertex shaders by passing instance IDs you should see a massive improvement over all of the solutions I have seen you propose so far (true instancing is actually somewhere between what you described in solutions 1 and 2) :)
The actual concept of instancing is very simple and will give you benefits whether your particular version of the OpenGL API supports it at the API level or not (you can always implement it manually with vertex attributes and extra vertex buffer data). The thing is, you would not have to duplicate your data at all if you implement instancing correctly. The extra data necessary to identify each individual vertex is static, and you can always change a shader uniform and make an additional draw call (this is probably what you will have to do with OpenGL ES 2.0, since it does not offer glDrawElementsInstanced) without touching any vertex data.
You certainly will not have to duplicate your vertices K*N times, your buffer space complexity would be more like O (K + K*M), where M is the number of new components you had to add to uniquely identify each vertex so that you could calculate everything on the GPU. For "instance," you might need to number each of the vertices in your quad 1-4 and process the vertex differently in your shader depending on which vertex you're processing. In this case, the M coefficient is 1 and it does not change no matter how many instances of your quad you need to dynamically calculate each frame; N would determine the number of draw calls in OpenGL ES 2.0, not the size of your data. None of this additional storage space would be necessary in OpenGL ES 2.0 if it supported gl_VertexID :(
Instancing is the best way to make effective use of the highly-parallel GPU and avoid CPU/GPU synchronization and slow bus transfers. Even though OpenGL ES 2.0 does not support instancing in the API sense, multiple draw calls using the same vertex buffer where the only thing you change between calls are a couple of shader uniforms is often preferable to computing your vertices on the CPU and uploading new vertex data every frame or having your vertex buffer's size depend directly on the number of instances you intend to draw (yuck). You'll have to try it out and see what your hardware likes.
Instancing would be what you are looking for but unfortunately it is not available with OpenGL ES 2.0. I would be in favor of sending all the vertices to the GPU and make one draw call if all your assets can fit into the GPU. I have an experience of reducing draw calls from 100+ to 1 and the performance went from 15 fps to 60 fps.

Rotate only part of the vertices?

I'm adding an OpenGL renderer to my 2D game engine and I want to know whether there is a way to apply an mvp matrix only to part of the vertices in a single draw call?
I'm planning to group draw calls by textures so I'll pass a buffer of many vertices and texcoords, now I want to apply different rotation angles to different quads. Is there a way to accomplish it in the shader or should I give up on the mvp matrix in the shader and perform the same thing using the cpu?
EDIT: What about adding 3 float attributes (rotation and rot_center.xy) per vertex?
what's better performance
(1) doing CPU rotation?
(2) providing 3 more floats per vertex
(3) separating draw calls?
Is there any other option?
Here is a possibility:
Do the rotation in the vertex shader. Pass in the information (angle?) needed to create the rotation matrix as a vertex attribute.
Pass in a vertex attribute (ubyte) that is effectively a per-vertex boolean flag. Rotation in #1 will be executed only if the bool is set.
Not sure if the above will work for you from a performance/storage perspective.
I think that, while it is a good thing to group draw calls for many different performance reasons, changing your code to satisfy a basic requirement as rotation is not a good idea.
Drawing batching is a good thing but, if you are forced to keep an additional attribute (because you cannot do it with uniforms for sure, you wouldn't have the information of the single entity) it is not worth.
An additional attribute means much more memory bandwidth usage that usually is the main killing factor for performances on nowadays systems.
Drawing batching, on the other side, is important but not always critical, it depends on many factors such as:
the GPU OpenGL driver optimization
The GPU tiles configuration
The number of shapes/draw calls we are talking about (if you have 20 quads on the screen, why should you bother of batching? :) )
In other words, often it is much more convenient to drop extreme batching in favor of easiness/main tenability and avoid fancy solutions for simple requirements as rotation.
I hope this helps in some way.
Use two different objects, that is all!
There is no other workaround for rotation of part of object
Example:
A game with a tank, where you want to rotate turret and remaining-body separately. Like in your case here these two are treated as separate objects.

Resources