Efficient way of drawing in OpenGL ES - opengl-es

In my application I draw a lot of cubes through OpenGL ES Api. All the cubes are of same dimensions, only they are located at different coordinates in space. I can think of two ways of drawing them, but I am not sure which is the most efficient one. I am no OpenGL expert, so I decided to ask here.
Method 1, which is what I use now: Since all the cubes are of identical dimensions, I calculate vertex buffer, index buffer, normal buffer and color buffer only once. During a refresh of the scene, I go over all cubes, do bufferData() for same set of buffers and then draw the triangle mesh of the cube using drawElements() call. Since each cube is at different position, I translate the mvMatrix before I draw. bufferData() and drawElements() is executed for each cube. In this method, I probably save a lot of memory, by not calculating the buffers every time. But I am making lot of drawElements() calls.
Method 2 would be: Treat all cubes as set of polygons spread all over the scene. Calculate vertex, index, color, normal buffers for each polygon (actually triangles within the polygons) and push them to graphics card memory in single call to bufferData(). Then draw them with single call to drawElements(). The advantage of this approach is, I do only one bindBuffer and drawElements call. The downside is, I use lot of memory to create the buffers.
My experience with OpenGL is limited enough, to not know which one of the above methods is better from performance point of view.
I am using this in a WebGL app, but it's a generic OpenGL ES question.

I implemented method 2 and it wins by a landslide. The supposed downside of high amount of memory seemed to be only my imagination. In fact the garbage collector got invoked in method 2 only once, while it was invoked for 4-5 times in method 1.
Your OpenGL scenario might be different from mine, but if you reached here in search of performance tips, the lesson from this question is: Identify the parts in your scene that don't change frequently. No matter how big they are, put them in single buffer set (VBOs) and upload to graphics memory minimum number of times. That's how VBOs are meant to be used. The memory bandwidth between client (i.e. your app) and graphics card is precious and you don't want to consume it often without reason.
Read the section "Vertex Buffer Objects" in Ch. 6 of "OpenGL ES 2.0 Programming Guide" to understand how they are supposed to be used. http://opengles-book.com/

I know that this question is already answered, but I think it's worth pointing out the Google IO presentation about WebGL optimization:
http://www.youtube.com/watch?v=rfQ8rKGTVlg
They cover, essentially, this exact same issue (lot's of identical shapes with different colors/positions) and talk about some great ways to optimize such a scene (and theirs is dynamic too!)

I propose following approach:
On load:
Generate coordinates buffer (for one cube) and load it into VBO (gl.glGenBuffers, gl.glBindBuffer)
On draw:
Bind buffer (gl.glBindBuffer)
Draw each cell (loop)
2.1. Move current position to center of current cube (gl.glTranslatef(position.x, position.y, position.z)
2.2. Draw current cube (gl.glDrawArrays)
2.3. Move position back (gl.glTranslatef(-position.x, -position.y, -position.z))

Related

Efficiently rendering a transparent terrain in OpenGL

I'm writing an OpenGL program that visualizes caves, so when I visualize the surface terrain I'd like to make it transparent, so you can see the caves below. I'm assuming I can normalize the data from a Digital Elevation Model into a grid aligned to the X/Z axes with regular spacing, and render each grid cell as two triangles. With an aligned grid I could avoid the cost of sorting when applying the painter's algorithm (to ensure proper transparency effects); instead I could just render the cells row by row, starting with the farthest row and the farthest cell of each row.
That's all well and good, but my question for OpenGL experts is, how could I draw the terrain most efficiently (and in a way that could scale to high resolution terrains) using OpenGL? There must be a better way than calling glDrawElements() once for every grid cell. Here are some ways I'm thinking about doing it (they involve features I haven't tried yet, that's why I'm asking the experts):
glMultiDrawElements Idea
Put all the terrain coordinates in a vertex buffer
Put all the coordinate indices in an element buffer
To draw, write the starting indices of each cell into an array in the desired order and call glMultiDrawElements with that array.
This seems pretty good, but I was wondering if there was any way I could avoid transferring an array of indices to the graphics card every frame, so I came up with the following idea:
Uniform Buffer Idea
This seems like a backward way of using OpenGL, but just putting it out there...
Put the terrain coordinates in a 2D array in a uniform buffer
Put coordinate index offsets 0..5 in a vertex buffer (they would have to be floats, I know)
call glDrawArraysInstanced - each instance will be one grid cell
the vertex shader examines the position of the camera relative to the terrain and determines how to order the cells, mapping gl_instanceId to the index of the first coordinate of the cell in the Uniform Buffer, and setting gl_Position to the coordinate at this index + the index offset attribute
I figure there might be shiny new OpenGL 4.0 features I'm not aware of that would be more elegant than either of these approaches. I'd appreciate any tips!
The glMultiDrawElements() approach sounds very reasonable. I would implement that first, and use it as a baseline you can compare to if you try more complex approaches.
If you have a chance to make it faster will depend on whether the processing of draw calls is an important bottleneck in your rendering. Unless the triangles you render are very small, and/or your fragment shader very simple, there's a good chance that you will be limited by fragment processing anyway. If you have profiling tools that allow you to collect data and identify bottlenecks, you can be much more targeted in your optimization efforts. Of course there is always the low-tech approach: If making the window smaller improves your performance, chances are that you're mostly fragment limited.
Back to your question: Since you asked about shiny new GL4 features, another method you could check out is indirect rendering, using glDrawElementsIndirect(). Beyond being more flexible, the main difference to glMultiDrawElements() is that the parameters used for each draw, like the start index in your case, can be sourced from a buffer. This might prevent one copy if you map this buffer, and write the start indices directly to the buffer. You could even combine it with persistent buffer mapping (look up GL_MAP_PERSISTENT_BIT) so that you don't have to map and unmap the buffer each time.
Your uniform buffer idea sounds pretty interesting. I'm slightly skeptical that it will perform better, but that's just a feeling, and not based on any data or direct experience. So I think you absolutely should try it, and report back on how well it works!
Stretching the scope of your question some more, you could also look into approaches for order-independent transparency rendering if you haven't considered and rejected them already. For example alpha-to-coverage is very easy to implement, and almost free if you would be using MSAA anyway. It doesn't produce very high quality transparency effects based on my limited attempts, but it could be very attractive if it does the job for your use case. Another technique for order-independent transparency is depth peeling.
If some self promotion is acceptable, I wrote an overview of some transparency rendering methods in an earlier answer here: OpenGL ES2 Alpha test problems.

What are the non-trivial use-cases of attributes in WebGL/OpenGL in general?

I'm making a WebGL game and eventually came up with a pretty convenient concept of object templates, when the game objects of the same kind (say, characters of the same race) are using the same template (which means: buffers, attributes and shader program), and are instanced from that template by specifying a set of uniforms (which are, in fact, the most common difference between the same-kind objects: model matrix, textures, bones positions, etc). For making independent objects with their own deep-copy of buffers, I just deep-copy and re-initialize the original template and start instantiating new objects from it.
But after that I started having doubts. Say, if I start using morphing on objects, by explicit editing of the vertices, this approach will require me to make a separate template for every object of such kind (otherwise, they would start morphing in exactly the same phase). Which is probably fine for this very case, 'cause I'll most likely need to recalculate normals and even texture coordinates, which means – most of the buffers.
But what if I'm missing some very common case of using attributes, say, blood decals, which will require me to update only a small piece of the buffer? In that case, it would be much more reasonable to have two buffers for each object: a common one that is shared by them all and the one for blood decals, which is unique for every single of them. And, as blood is usually spilled on everything, this sounds pretty reasonable, so that we would save a lot of space by storing vertices, normals and such without their unnecessary duplication.
I haven't tried implementing decals yet, so honestly not even sure if implementing them using vertex painting (textured or not) is the right choice. But I'm also pretty sure there are some commonly used attributes aside from vertices, normals and texture coordinates.
Here are some that I managed to come up with myself:
decals (probably better to be modelled as separate objects?)
bullet holes and such (same as decals maybe?)
Any thoughts?
UPD: as all this might sound confusing, I want to clarify: I do understand that using as few buffers as possible is a good thing, this is exactly why I'm trying to use this templates concept. My question is: what are the possible cases when using a single buffer and a single element buffer (with both of them shared between similar objects) for a template is going to stab me in the back?
Keeping a giant chunk of data that won't change on the card is incredibly useful for saving bandwidth. Additionally, you probably won't be directly changing the vertices positions once they are on the card. Instead you will probably morph them with passed in uniforms in the Vertex shader through Skeletal animation. Read about it here: Skeletal Animation
Do keep in mind though, that in Key frame animation with meshes, you would keep a bunch of buffers on the card each in a different key frame pose of the animation. However, you would then load whatever two key frames you want to interpolate over in as attributes and then blend between them (You can have more than two). Keyframe Animation
Additionally, with the introduction of Transformation Feedback, (No you don't get to use it in WebGL, it became core in OpenGL 3.0, WebGL is based on OpenGL ES 2.0, which is based on OpenGL 2.0) you can start keeping calculated data GPU side. In other words, you can do a giant particle system simulation in the vertex or geometry shader and then store the calculated data into another buffer, then use that buffer in the next frame without having to have a round trip from the GPU to CPU Read about them here: Transform Feedback and here: Transform Feedback how to
In general, you don't want to touch buffers once they are on the card, especially every frame. Instead load several and use pointers to that data in shaders as attributes.

OpenGL Optimization - Duplicate Vertex Stream or Call glDrawElements Repeatedly?

This is for an OpenGL ES 2.0 game on Android, though I suspect the right answer is generic to any opengl situation.
TL;DR - is it better to send N data to the gpu once and then make K draw calls with it; or send K*N data to the gpu once, and make 1 draw call?
More Details I'm wondering about best practices for my situation. I have a dynamic mesh whose vertices I recompute every frame - think of it as a water surface - and I need to project these vertices onto K different quads in my game. (In each case the projection is slightly different; sparing details, you could imagine them as K different mirrors surrounding the mesh.) K is in the order of 10-25; I'm still figuring it out.
I can think of two broad options:
Bind the mesh as is, and call draw K different times, either
changing a uniform for shaders or messing with the fixed function
state to render to the correct quad in place (on the screen) or to different
segments of a texture (which I can later use when rendering the quads to achieve
the same effect).
Duplicate all the vertices in the mesh K times, essentially making a
single vertex stream with K meshes in it, and add an attribute (or
few) indicating which quad each mesh clone is supposed to project
onto (and how to get there), and use vertex shaders to project. I
would make one call to draw, but send K times as much data.
The Question: of those two options, which is generally better performance wise?
(Additionally: is there a better way to do this?
I had considered a third option, where I rendered the mesh details to a texture, and created my K-clone geometry as a sort of dummy stream, which I could bind once and for all, that looked up in a vertex shader into the texture for each vertex to find out what vertex it really represented; but I've been told that texture support in vertex shaders is poor or prohibited in OpenGL ES 2.0 and would prefer to avoid that route.)
There is no perfect answer to this question, though I would suggest you think about the nature of real-time computer graphics and the OpenGL pipeline. Although "the GL" is required to produce results that are consistent with in-order execution, the reality is that GPUs are highly parallel beasts. They employ lots of tricks that work best if you actually have many unrelated tasks going on at the same time (some even split the whole pipeline up into discrete tiles). GDDR memory, for instance is really high latency, so for efficiency GPUs need to be able to schedule other jobs to keep the stream processors (shader units) busy while memory is fetched for a job that is just starting.
If you are recomputing parts of your mesh each frame, then you will almost certainly want to favor more draw calls over massive CPU->GPU data transfers every frame. Saturating the bus with unnecessary data transfers plagues even PCI Express hardware (it is far slower than the overhead that several additional draw calls would ever add), it can only get worse on embedded OpenGL ES systems. Having said that, there is no reason you could not simply do glBufferSubData (...) to stream in only the affected portions of your mesh and continue to draw the entire mesh in a single draw call.
You might get better cache coherency if you split (or partition the data within) the buffer and/or draw calls up, depending on your actual use case scenario. The only way to decisively tell which is going to work better in your case is to profile your software on your target hardware. But all of this fail to look at the bigger picture, which is: "Why am I doing this on the CPU?!"
It sounds like what you really want is simply vertex instancing. If you can re-work your algorithm to work completely in vertex shaders by passing instance IDs you should see a massive improvement over all of the solutions I have seen you propose so far (true instancing is actually somewhere between what you described in solutions 1 and 2) :)
The actual concept of instancing is very simple and will give you benefits whether your particular version of the OpenGL API supports it at the API level or not (you can always implement it manually with vertex attributes and extra vertex buffer data). The thing is, you would not have to duplicate your data at all if you implement instancing correctly. The extra data necessary to identify each individual vertex is static, and you can always change a shader uniform and make an additional draw call (this is probably what you will have to do with OpenGL ES 2.0, since it does not offer glDrawElementsInstanced) without touching any vertex data.
You certainly will not have to duplicate your vertices K*N times, your buffer space complexity would be more like O (K + K*M), where M is the number of new components you had to add to uniquely identify each vertex so that you could calculate everything on the GPU. For "instance," you might need to number each of the vertices in your quad 1-4 and process the vertex differently in your shader depending on which vertex you're processing. In this case, the M coefficient is 1 and it does not change no matter how many instances of your quad you need to dynamically calculate each frame; N would determine the number of draw calls in OpenGL ES 2.0, not the size of your data. None of this additional storage space would be necessary in OpenGL ES 2.0 if it supported gl_VertexID :(
Instancing is the best way to make effective use of the highly-parallel GPU and avoid CPU/GPU synchronization and slow bus transfers. Even though OpenGL ES 2.0 does not support instancing in the API sense, multiple draw calls using the same vertex buffer where the only thing you change between calls are a couple of shader uniforms is often preferable to computing your vertices on the CPU and uploading new vertex data every frame or having your vertex buffer's size depend directly on the number of instances you intend to draw (yuck). You'll have to try it out and see what your hardware likes.
Instancing would be what you are looking for but unfortunately it is not available with OpenGL ES 2.0. I would be in favor of sending all the vertices to the GPU and make one draw call if all your assets can fit into the GPU. I have an experience of reducing draw calls from 100+ to 1 and the performance went from 15 fps to 60 fps.

OpenGL tile rendering: most efficient way?

I am creating a tile-based 2D game as a way of learning basic "modern" OpenGL concepts. I'm using shaders with OpenGL 2.1., and am familiar with the rendering pipeline and how to actually draw geometry on-screen. What I'm wondering is the best way to organize a tilemap to render quickly and efficiently. I have thought of several potential methods:
1.) Store the quad representing a single tile (vertices and texture coordinates) in a VBO and render each tile with a separate draw* call, translating it to the correct position onscreen and using uniform2i to give the location in the texture atlas for that particular tile;
2.) Keep a VBO containing every tile onscreen (already-computed screen coordinates and texture atlas coordinates), using BufferSubData to update the tiles every frame but using a single draw* call;
3.) Keep VBOs containing static NxN "chunks" of tiles, drawing however many chunks of tiles are at least partially visible onscreen and translating them each into position.
*I'd like to stay away from the last option if possible unless rendering chunks of 64x64 is not too inefficient. Tiles are loaded into memory in blocks of that size, and even though only about 20x40 tiles are visible onscreen at a time, I would have to render up to four chunks at once. This method would also complicate my code in several other ways.
So, which of these is the most efficient way to render a screen of tiles? Are there any better methods?
You could do any one of these and they would probably be fine; what you're proposing to render is very, very simple.
#1 will definitely be worse in principle than the other options, because you would be drawing many extremely simple “models” rather than letting the GPU do a whole lot of batch work on one draw call. However, if you have only 20×40 = 800 tiles visible on screen at once, then this is a trivial amount of work for any modern CPU and GPU (unless you're doing some crazy fragment shader).
I recommend you go with whichever is simplest to program for you, so that you can continue work on your game. I imagine this would be #1, or possibly #2. If and when you find yourself with a performance problem, do whichever of #2 or #3 (64×64 sounds like a fine chunk size) lets you spend the least CPU time on your program's part of drawing (i.e. updating the buffer(s)).
I've been recently learning modern OpenGL myself, through OpenGL ES 2.0 on Android. The OpenGL ES 2.0 Programming Guide recommends an "array of structures", that is,
"Store vertex attributes together in a single buffer. The structure represents all attributes of a vertex and we have an array of these attributes per vertex."
While this may seem like it would initially consume a lot of space, it allows for efficient rendering using VBOs and flexibility in texture mapping each tile. I recently did a tiled hex grid using interleaved arrays containing vertex, normals, color, and texture data for a 20x20 tile hex grid on a Droid 2. So far things are running smoothly.

OpenGL drawing the same polygon many times in multiple places

In my opengl app, I am drawing the same polygon approximately 50k times but at different points on the screen. In my current approach, I do the following:
Draw the polygon once into a display list
for each instance of the polygon, push the matrix, translate to that point, scale and rotate appropriate (the scaling of each point will be the same, the translation and rotation will not).
However, with 50k polygons, this is 50k push and pops and computations of the correct matrix translations to move to the correct point.
A coworker of mine also suggested drawing the entire scene into a buffer and then just drawing the whole buffer with a single translation. The tradeoff here is that we need to keep all of the polygon vertices in memory rather than just the display list, but we wouldn't need to do a push/translate/scale/rotate/pop for each vertex.
The first approach is the one we currently have implemented, and I would prefer to see if we can improve that since it would require major changes to do it the second way (however, if the second way is much faster, we can always do the rewrite).
Are all of these push/pops necessary? Is there a faster way to do this? And should I be concerned that this many push/pops will degrade performance?
It depends on your ultimate goal. More recent OpenGL specs enable features for "geometry instancing". You can load all the matrices into a buffer and then draw all 50k with a single "draw instances" call (OpenGL 3+). If you are looking for a temporary fix, at the very least, load the polygon into a Vertex Buffer Object. Display Lists are very old and deprecated.
Are these 50k polygons going to move independently? You'll have to put up with some form of "pushing/popping" (even though modern scene graphs do not necessarily use an explicit matrix stack). If the 50k polygons are static, you could pre-compile the entire scene into one VBO. That would make it render very fast.
If you can assume a recent version of OpenGL (>=3.1, IIRC) you might want to look at glDrawArraysInstanced and/or glDrawElementsInstanced. For older versions, you can probably use glDrawArraysInstancedEXT/`glDrawElementsInstancedEXT, but they're extensions, so you'll have to access them as such.
Either way, the general idea is fairly simple: you have one mesh, and multiple transforms specifying where to draw the mesh, then you step through and draw the mesh with the different transforms. Note, however, that this doesn't necessarily give a major improvement -- it depends on the implementation (even more than most things do).

Resources