culling instanced meshes without geometry shader - opengl-es

Whats an effective way to cull instanced meshes
(f.e. 2000 trees, each with ~ 17 triangles) without using the geometry shader ?
Unfortunately my software supports only OpenGL ES 3.0, so have to cull in the vertex shader or somewhere else.
Another solution would be to rearrange the instance buffer in each frame.

GPU culling is pointless if it cannot be done efficiently; that is, after all, the whole point of putting culling on the GPU to begin with.
Efficient GPU culling requires the following:
A way to conditionally write data to GPU memory from a shader, and with a controllable format.
A way to have rendering commands executed based on data stored entirely within GPU memory, without CPU/GPU synchronization.
OpenGL ES 3.0 lacks a mechanism for doing either of these. Geometry shaders and transform feedback are the older means for doing #1, but it could also be done with compute shaders and SSBOs/image load/store. Of course, ES 3.0 has neither sets of functionality; you'd need ES 3.1 for that.
ES 3.0 also has no indirect rendering features, which could be used to actually render with the GPU-generated data without any read-back of data from the CPU. So even if you had a way to do #1, you'd have to read the data back on the CPU to be able to use it in a rendering command.
So unless CPU culling is somehow more expensive than doing a full GPU/CPU sync (it almost certainly isn't), it's best to just do the culling on the CPU.

Related

How do I see the GPU's bottleneck in a complex algorithm?

I'm using GLSL fragment shaders for GPGPU calculations (I have my reasons).
In nSight I see that I'm doing 1600 drawcalls per frame.
There could be 3 bottlenecks:
Fillrate
Just too many drawcalls
GPU stalls due to my GPU->CPU downloads and CPU->GPU uploads
How do I find which one it is?
If my algorithm was simple (e.g. a gaussian blur or something), I could force the viewport of each drawcall to be 1x1, and depending on the speed change, I could rule out a fillrate problem.
In my case, though, that would require changing the entire algorithm.
Since you're mentioning Nvidia NSight tool, you could try to follow the procedures explained in the following Nvidia blog post.
It explains how to read and understand hardware performance counters to interpret performance bottlenecks.
The Peak-Performance-Percentage Analysis Method for Optimizing Any GPU Workload :
https://devblogs.nvidia.com/the-peak-performance-analysis-method-for-optimizing-any-gpu-workload/
Instead of finding the one, change the ways to calculate.
I'm using GLSL fragment shaders for GPGPU calculations (I have my reasons).
I am not sure what your OpenGL version is but using computer shader over FS will solve the issue
In nSight I see that I'm doing 1600 drawcalls per frame.
Do you mean actual OpenGL drawcalls? it muse be one of reasons for sure. You may draw something on FBOs to calculate them using GPU. That is the big difference between Computer Shader & Fragment Shader. Draw calls always slow down the program but Computer shader.
An architectural advantage of compute shaders for image processing is
that they skip the ROP(Render output unit) step. It's very likely that writes from pixel
shaders go through all the regular blending hardware even if you don't
use it.
If you have to use FS somehow, then
try to find reduce the drawcalls.
find the way to store data that is being calculated.
It would be like using render textures as a memory, if you need to change vertices using RTTs, you would have to load textures as position, velocity or whatever you need to change vertices or its attributes like normal/color.
To find the actual reason, better use GPU& GPU profilers depending on your chipset and OS.

How can I perform arbitrary computation in OpenGL ES 3.0 (prior to compute shaders)

Prior to the introduction of compute shaders in OpenGL ES 3.1, what techniques or tricks can be used to perform general computations on the GPU? e.g. I am animating a particle system and I'd like to farm out some work to the GPU. Can I make use of vertex shaders with "fake" vertex data somehow?
EDIT:
I found this example which looks helpful: http://ciechanowski.me/blog/2014/01/05/exploring_gpgpu_on_ios/
You can use vertex shaders and transform feedback to output the results to an application accessible buffer. The main downside is that you can't have cross-thread data sharing between "work items" like you can with a compute shader, so they are not 100% equivalent.

OpenGL (ES): Can an implementation optimize fragments resulting from overdraw?

I wanted to come up with a crude way to "benchmark" the performance improvement of a tweak I made to a fragment shader (to be specific, I wanted to test the performance impact of the removal of the computation of the gamma for the resulting color using pow in the fragment shader).
So I figured that if a frame was taking 1ms to render an opaque cube model using my shader that if I set glDisable(GL_DEPTH_TEST) and loop my render call 100 times, that the frame would take 100ms to render.
I was wrong. Rendering it 100 times only results in about a 10x slowdown. Obviously if depth test is still enabled, most if not all of the fragments in the second and subsequent draw calls would not be computed because they would all fail the depth test.
However I must still be experiencing a lot of fragment culls even with depth test off.
My question is about whether my hardware (in this particular situation it is an iPad3 on iOS6.1 that I am experiencing this on -- a PowerVR SGX543MP4) is just being incredibly smart and is actually able to use the geometry of later draw calls to occlude and discard fragments from the earlier geometry. If this is not what's happening, then I cannot explain the better-than-expected performance that I am seeing. The question applies to all flavors of OpenGL and desktop GPUs as well, though.
Edit: I think an easy way to "get around" this optimization might be glEnable(GL_BLEND) or something of that sort. I will try this and report back.
PowerVR hardware is based on tile-based deferred rendering. It does not begin drawing fragments until after it receives all of the geometry information for a tile on screen. This is a more advanced hidden-surface removal technique than z-buffering, and what you have actually discovered here is that enabling alpha blending breaks the hardware's ability to exploit this.
Alpha blending is very order-dependent, and so no longer can rasterization and shading be deferred to the point where only the top-most geometry in a tile has to be drawn. Without alpha blending, since there is no data dependency on the order things are drawn in, completely obscured geometry can be skipped before expensive per-fragment operations occur. It is only when you start blending fragments that a true order-dependent situation arises and completely destroys the hardware's ability to defer/cull fragment processing for hidden surfaces.
In all honesty, if you are trying to optimize for a platform based on PowerVR hardware you should probably make this one of your goals. By that, I mean, before optimizing shaders first consider whether you are drawing things in an order and/or with states that hurt the PowerVR hardware's ability to do TBDR. As you have just discovered, blending is considerably more expensive on PowerVR hardware than other hardware... the operation itself is no more complicated, it just prevents PVR hardware from working the special way it was designed to.
I can confirm that only after adding both lines:
glEnable(GL_BLEND);
glBlendFunc(GL_SRC_ALPHA,GL_ONE_MINUS_SRC_ALPHA);
did the frame render time increase in a linear fashion in response to the repeated draw calls. Now back to my crude benchmarking.

OpenGL Optimization - Duplicate Vertex Stream or Call glDrawElements Repeatedly?

This is for an OpenGL ES 2.0 game on Android, though I suspect the right answer is generic to any opengl situation.
TL;DR - is it better to send N data to the gpu once and then make K draw calls with it; or send K*N data to the gpu once, and make 1 draw call?
More Details I'm wondering about best practices for my situation. I have a dynamic mesh whose vertices I recompute every frame - think of it as a water surface - and I need to project these vertices onto K different quads in my game. (In each case the projection is slightly different; sparing details, you could imagine them as K different mirrors surrounding the mesh.) K is in the order of 10-25; I'm still figuring it out.
I can think of two broad options:
Bind the mesh as is, and call draw K different times, either
changing a uniform for shaders or messing with the fixed function
state to render to the correct quad in place (on the screen) or to different
segments of a texture (which I can later use when rendering the quads to achieve
the same effect).
Duplicate all the vertices in the mesh K times, essentially making a
single vertex stream with K meshes in it, and add an attribute (or
few) indicating which quad each mesh clone is supposed to project
onto (and how to get there), and use vertex shaders to project. I
would make one call to draw, but send K times as much data.
The Question: of those two options, which is generally better performance wise?
(Additionally: is there a better way to do this?
I had considered a third option, where I rendered the mesh details to a texture, and created my K-clone geometry as a sort of dummy stream, which I could bind once and for all, that looked up in a vertex shader into the texture for each vertex to find out what vertex it really represented; but I've been told that texture support in vertex shaders is poor or prohibited in OpenGL ES 2.0 and would prefer to avoid that route.)
There is no perfect answer to this question, though I would suggest you think about the nature of real-time computer graphics and the OpenGL pipeline. Although "the GL" is required to produce results that are consistent with in-order execution, the reality is that GPUs are highly parallel beasts. They employ lots of tricks that work best if you actually have many unrelated tasks going on at the same time (some even split the whole pipeline up into discrete tiles). GDDR memory, for instance is really high latency, so for efficiency GPUs need to be able to schedule other jobs to keep the stream processors (shader units) busy while memory is fetched for a job that is just starting.
If you are recomputing parts of your mesh each frame, then you will almost certainly want to favor more draw calls over massive CPU->GPU data transfers every frame. Saturating the bus with unnecessary data transfers plagues even PCI Express hardware (it is far slower than the overhead that several additional draw calls would ever add), it can only get worse on embedded OpenGL ES systems. Having said that, there is no reason you could not simply do glBufferSubData (...) to stream in only the affected portions of your mesh and continue to draw the entire mesh in a single draw call.
You might get better cache coherency if you split (or partition the data within) the buffer and/or draw calls up, depending on your actual use case scenario. The only way to decisively tell which is going to work better in your case is to profile your software on your target hardware. But all of this fail to look at the bigger picture, which is: "Why am I doing this on the CPU?!"
It sounds like what you really want is simply vertex instancing. If you can re-work your algorithm to work completely in vertex shaders by passing instance IDs you should see a massive improvement over all of the solutions I have seen you propose so far (true instancing is actually somewhere between what you described in solutions 1 and 2) :)
The actual concept of instancing is very simple and will give you benefits whether your particular version of the OpenGL API supports it at the API level or not (you can always implement it manually with vertex attributes and extra vertex buffer data). The thing is, you would not have to duplicate your data at all if you implement instancing correctly. The extra data necessary to identify each individual vertex is static, and you can always change a shader uniform and make an additional draw call (this is probably what you will have to do with OpenGL ES 2.0, since it does not offer glDrawElementsInstanced) without touching any vertex data.
You certainly will not have to duplicate your vertices K*N times, your buffer space complexity would be more like O (K + K*M), where M is the number of new components you had to add to uniquely identify each vertex so that you could calculate everything on the GPU. For "instance," you might need to number each of the vertices in your quad 1-4 and process the vertex differently in your shader depending on which vertex you're processing. In this case, the M coefficient is 1 and it does not change no matter how many instances of your quad you need to dynamically calculate each frame; N would determine the number of draw calls in OpenGL ES 2.0, not the size of your data. None of this additional storage space would be necessary in OpenGL ES 2.0 if it supported gl_VertexID :(
Instancing is the best way to make effective use of the highly-parallel GPU and avoid CPU/GPU synchronization and slow bus transfers. Even though OpenGL ES 2.0 does not support instancing in the API sense, multiple draw calls using the same vertex buffer where the only thing you change between calls are a couple of shader uniforms is often preferable to computing your vertices on the CPU and uploading new vertex data every frame or having your vertex buffer's size depend directly on the number of instances you intend to draw (yuck). You'll have to try it out and see what your hardware likes.
Instancing would be what you are looking for but unfortunately it is not available with OpenGL ES 2.0. I would be in favor of sending all the vertices to the GPU and make one draw call if all your assets can fit into the GPU. I have an experience of reducing draw calls from 100+ to 1 and the performance went from 15 fps to 60 fps.

Is the distinction between vertex and pixel shader necessary or even beneficial?

From what I've been able to get, both vertex and pixel shader operations boil down to passing data and doing a lot of the same with it for every available unit. Surely, vertex and pixel shaders are in different parts of the classical graphical pipeline, but wouldn't it be better to have more abstraction and just be able to create arbitrary order of arbitrary data and parallel operations with it? I guess such an abstraction can also be applied to emulate OpenCL, compute shaders and whatever general or specialized compute API.
Specialization helps drivers perform optimally and simplifies application code. Pixel-shaders occur after rasterization. It's great to not have to worry about rasterizing. You could use CUDA or OpenCL to do anything you like in a completely graphics agnostic way.
Alt. Yes. It's coming.
Historically the distinction was necessary because vertex and pixel shaders were physically implemented in different hardware units with different capabilities. Nowadays pretty much all PC GPUs and even many mobile GPUs have unified shader architectures where vertex and pixel shaders execute on the same hardware compute units so the distinction is less necessary. It is still useful in the context of a graphics pipeline however since the inputs and outputs and their meanings are determined by the logical position of the vertex and pixel shaders in the rendering pipeline.
For GPGPU type problems where the traditional graphics pipeline is not meaningful / relevant you have compute shaders that expose the full capabilities of the underlying hardware outside of the traditional vertex / pixel shader model. Compute shaders in a sense are the abstraction you're talking about so it doesn't really make sense to talk about 'emulating' them with such an abstraction. Compute shaders are just a way to expose the physical hardware compute units used by vertex and pixel shaders for use outside of the traditional graphics pipeline.

Resources