Transform Feedback: batch several feedbacks together - opengl-es

Target: OpenGL ES >= 3.0.
Here's what my app does:
generateSeveralMeshes()
setupStuff();
for (each Mesh)
{
glBindBufferBase(GLES30.GL_TRANSFORM_FEEDBACK_BUFFER, 0, myBuf);
glBeginTransformFeedback( GLES30.GL_POINTS);
callOpenGLToGetTransformFeedback();
glMapBufferRange(GLES30.GL_TRANSFORM_FEEDBACK_BUFFER, ...) // THE PROBLEM
computeStuffDependantOnVertexAttribsGottenBack();
glUnmapBuffer(GLES30.GL_TRANSFORM_FEEDBACK_BUFFER);
glEndTransformFeedback();
glBindBufferBase(GLES30.GL_TRANSFORM_FEEDBACK_BUFFER, 0, 0);
renderTheMeshAsNormal();
}
i.e. for each Mesh, it first uses the Vertex Shader to compute some per-vertex stuff, gets the stuff back to CPU, based on that makes some decisions, and only then renders the Mesh.
This works, the problem is speed. We've been testing on several OpenGL ES 3.0, 3.1, 3.2-based devices, and on each one the story looks the same: the 'glMapBufferRange()' call cuts the FPS to about half!
I suspect that without glMapBufferRange(), OpenGL can render 'lazily' , i.e. batch up several renders together and do them at its own convenience, whereas if we call glMapBufferRange(), it really needs to render now which probably makes it slow (the amount of data that we get back is quite small, I really don't think this is the problem).
Thus, I'd like to batch up my Transform Feedback as well, like this:
generateSeveralMeshes()
setupStuff();
for (each Mesh)
{
glBindBufferBase(GLES30.GL_TRANSFORM_FEEDBACK_BUFFER, 0, myLargerBuf);
glBeginTransformFeedback( GLES30.GL_POINTS);
setupOpenGLtoSaveTransformFeedbackToSpecificOffset();
callOpenGLToGetTransformFeedback();
advanceOffset();
glEndTransformFeedback();
glBindBufferBase(GLES30.GL_TRANSFORM_FEEDBACK_BUFFER, 0, 0);
renderTheMeshAsNormal();
}
glMapBufferRange(GLES30.GL_TRANSFORM_FEEDBACK_BUFFER, ...)
computeStuffDependantOnVertexAttribsGottenBackInOneBatch();
glUnmapBuffer(GLES30.GL_TRANSFORM_FEEDBACK_BUFFER);
The problem is that I don't know how to tell OpenGL to save the Transform Feedback output not to the beginning, but to a specific offset in the TRANSFORM_FEEDBACK_BUFFER (so that I can later on, after the loop, lay my hands on all TF data gotten back in one go).
Any advice?

The performance issue is pipelining - you're basically forcing the GPU into lockstep with the CPU because the glMapBufferRange() has to block until the result is available. This is "very bad" - all GPUs (especially tile-based GPUs in mobile) rely on the driver building up a queue of work which runs asynchronously to the application and thus keeps forward pressure to keep the hardware busy. Anything the application does to force synchronization and drain the pipeline will kill performance.
Good blog on it here:
https://community.arm.com/graphics/b/blog/posts/the-mali-gpu-an-abstract-machine-part-1---frame-pipelining
In general if you are reading back on the CPU only read data back one or two frames after you queued the draw calls which generated it. (Consuming results on the GPU doesn't have this problem - that will pipeline).
To bind buffer offsets into a transform feedback buffer, as per the comment use glBindBufferRange().

Related

How to optimize a call to webGL.bufferData with a large amount of static vertex data (500k vertices)

Hi all I am making a minecraft styled game, hence the large number of vertices in a single bufferData call. I have already optimised things so that each vertice only has a single float32 attribute, that basically constitutes a vertice ID. so 500k vertices only use 500k float32 attributes total.
The attributes in the vertex buffer are the equivalent of a series of ordered numbers that will never change. It is just the amount of them that I send through that does.
The actual data for the minecraft style cubes is converted to a texture and passed in to the shader like that. The shader does all the work of decoding the texture data and mapping it to each vertice.
I am currently getting 60fps in chrome running this lot, but I want to push it up to at least 2million vertices, so I need to somehow improve the performance of this lot.
When performance benchmarking in chome it appears that the cpu, not the gpu is the main bottleneck. The call to bufferData consumes 33% of cpu resources and the call to texImage2d consumes 9.5% of cpu resources
I am looking for any ideas on how to improve this performance.
The link to project is here
The link to the js file that contains the webGl calls is here
So I found the solution. You can use bufferSubData to update the vertice data in an existing vertexBuffer and just store, rebind and reuse that. However in my case the data in the vertexBuffer itself never changes, so all I needed to do was store the created vertexBuffer and rebind it whenever I want to use it, and absolutely no calls to any kind of buffer method are needed. Thereby eliminating the 33% cpu used for bufferData competely! Must still do further testing, but looking good so far.
Using this solution I was able to go as high as 14million vertices and that's only because of other limitations not the call to teximage2d which is still only sitting at around 30% now.

WebGL: What is faster?

What is faster in WebGL?
once:
create 1000 shaders for 1000 objects and set uniforms to them
every frame:
bind shaders when rendering them
Or
once:
create 10 shaders for 1000 objects
every frame:
bind shaders + update uniforms according to objects?
I know I can write test on it. But I feel that someone surely thought about it before me. Thank you very much.
It's helpful to remember that the Graphics Pipeline is an actual pipeline typically implemented in hardware. You get to configure the pipeline by assigning shaders and setting uniforms, and then you get to activate the pipeline (by calling drawElements or one of its friends). This essentially loads a pile of input data into the start of the pipeline, and kicks off a process that is highly parallel. For example, in the middle of a run, some early vertices will have made it through the vertex shader and rasterizer, and the resulting fragments are being shaded, while other vertices are still back at the vertex shader stage being transformed. The different sections of the pipeline are all doing their thing to the data flowing by.
After you kick off this process, the CPU is free to do other stuff while the pipeline runs. But, if you want to reconfigure the pipeline, such as by changing shaders or altering uniforms, the CPU will block your thread and wait for the pipeline to be completely done to the last pixel.
This means you want to avoid stopping and restarting the pipeline, to the extent possible. So the usual strategy is batching: Get as much work done as possible in a single draw call, with a single set of uniforms. That way, you exploit the parallel nature of the pipeline to the best extent possible in your app.
Changing shaders is expensive (it invalidates the instruction cache), updating uniforms cheap (it just updates values in a register file).

Rendering in DirectX 11

When frame starts, I do my logical update and render after that.
In my render code I do usual stuff. I set few states, buffors, textures, and end by calling Draw.
m_deviceContext->Draw(
nbVertices,
0);
At frame end I call present to show rendered frame.
// Present the back buffer to the screen since rendering is complete.
if(m_vsync_enabled)
{
// Lock to screen refresh rate.
m_swapChain->Present(1, 0);
}
else
{
// Present as fast as possible.
m_swapChain->Present(0, 0);
}
Usual stuff. Now, when I call Draw, according to MSDN
Draw submits work to the rendering pipeline.
Does it mean that data is send to GPU and main thread (the one called Draw) continues? Or does it wait for rendering to finish?
In my opinion, only Present function should make main thread wait for rendering to finish.
There are a number of calls which can trigger the GPU to start working, Draw being one. Other's include Dispatch, CopyResource, etc. What the MSDN docs are trying to say is that stuff like PSSetShader. IASetPrimitiveTopology, etc. doesn't really do anything until you call Draw.
When you call Present that is taken as an implicit indicator of 'end of frame' but your program can often continue on with setting up rendering calls for the next frame well before the first frame is done and showing. By default, Windows will let you queue up to 3 frames ahead before blocking your CPU thread on the Present call to let the GPU catch-up--in real-time rendering you usually don't want the latency between input and render to be really high.
The fact is, however, that GPU/CPU synchronization is complicated and the Direct3D runtime is also batcning up requests to minimize kernel-call overhead so the actual work could be happing after many Draws are submitted to the command-queue. This old article gives you the flavor of how this works. On modern GPUs, you can also have various memory operations for paging in memory, setting up physical video memory areas, etc.
BTW, all this 'magic' doesn't exist with Direct3D 12 but that means the application has to do everything at the 'right' time to ensure it is both efficient and functional. The programmer is much more directly building up command-queues, triggering work on various pixel and compute GPU engines, and doing all the messy stuff that is handled a little more abstracted and automatically by Direct3 11's runtime. Even still, ultimately the video driver is the one actually talking to the hardware so they can do other kinds of optimizations as well.
The general rules of thumb here to keep in mind:
Creating resources is expensive, especially runtime shader compilation (by HLSL complier) and runtime shader blob optimization (by driver)
Copying resources to the GPU (i.e. loading texture data from the CPU memory) requires bus bandwidth that is limited in supply: Prefer to keep textures, VB, and IB data in Static buffers you reuse.
Copying resources from the GPU (i.e. moving GPU memory to CPU memory) uses a backchannel that is slower than going to the GPU: try to avoid the need for readback from the GPU
Submitting larger chunks of geometry per Draw call helps to amortize overhead (i.e. calling draw once for 10,000 triangles with the same state/shader is much faster than calling draw 10 times for a 1000 triangles each with changing state/shaders between).

How to temporarily disable OpenGL command queueing, for more accurate profiling results?

In Mac OS X's OpenGL Profiler app, I can get statistics regarding how long each GL function call takes. However, the results show that a ton of time is spent in flush commands (glFlush, glFlushRenderAPPLE, CGLFlushDrawable) and in glDrawElements, and every other GL function call's time is negligibly small.
I assume this is because OpenGL is enqueueing the commands I submit, and waiting until flushing or drawing to actually execute the commands.
I guess I could do something like this:
glFlush();
startTiming();
glDoSomething();
glFlush();
stopTimingAndRecordDelta();
...and insert that pattern around every GL function call my app makes, but that would be tedious since there are thousands of GL function calls throughout my app, and I'd have to tabulate the results manually (instead of using the already-existent OpenGL Profiler tool).
So, is there a way to disable all OpenGL command queueing, so I can get more accurate profiling results?
So, is there a way to disable all OpenGL command queueing, ...
No, there isn't an OpenGL function that does that.
..., so I can get more accurate profiling results?
You can get more accurate information than you are currently, but you'll never get really precise answers (but you can probably get what you need). While the results of OpenGL rendering are the "same" — OpenGL's not guaranteed to be pixel-accurate across implementations — they're supposed to be very close. However, how the pixels are generated can vary drastically. In particular, tiled-reneders (in mobile and embedded devices) usually don't render pixels during a draw call, but rather queue up the geometry, and generate the pixels at buffer swap.
That said, for profiling OpenGL, you want to use glFinish, instead of glFlush. glFinish will force all pending OpenGL calls to complete and return; glFlush merely requests that commands be sent to the OpenGL "at some time in the future", so it's not deterministic. Be sure to remove your glFinish in your "production" code, since it will really slow down your application. From your example, if you replace the flushes with finishes in your example, you'll get more interesting information.
You are using OpenGL 3, and in particular discussing OS X. Mavericks (10.9) supports Timer Queries, which you can use to time a single GL operation or an entire sequence of operations at the pipeline level. That is, how long they take to execute when GL actually gets around to performing them, rather than timing how long a particular API call takes to return (which is often meaningless). You can only have a single timer query in the pipeline at a given time unfortunately, so you may have to structure your software cleverly to make best use of them if you want command-level granularity.
I use them in my own work to time individual stages of the graphics engine. Things like how long it takes to update shadow maps, build the G-Buffers, perform deferred / forward lighting, individual HDR post-processing effects, etc. It really helps identify bottlenecks if you structure the timer queries this way instead of focusing on individual commands.
For instance on some filtrate limited hardware shadow map generation is the biggest bottleneck, on other shader limited hardware, lighting is. You can even use the results to determine the optimal shadow map resolution or lighting quality to meet a target framerate for a particular host without requiring the user to set these parameters manually. If you simply timed how long the individual operations took you would never get the bigger picture, but if you time entire sequences of commands that actually do some major part of your rendering you get neatly packed information that can be a lot more useful than even the output from profilers.

why are draw calls expensive?

assuming the texture, vertex, and shader data are already on the graphics card, you don't need to send much data to the card. there's a few bytes to identify the data, and presumably a 4x4 matrix, and some assorted other parameters.
so where is all of the overhead coming from? do the operations require a handshake of some sort with the gpu?
why is sending a single mesh containing a bunch of small models, calculated on the CPU, often faster than sending the vertex id and transformation matrices? (the second option looks like there should be less data sent, unless the models are smaller than a 4x4 matrix)
First of all, I'm assuming that with "draw calls", you mean the command that tells the GPU to render a certain set of vertices as triangles with a certain state (shaders, blend state and so on).
Draw calls aren't necessarily expensive. In older versions of Direct3D, many calls required a context switch, which was expensive, but this isn't true in newer versions.
The main reason to make fewer draw calls is that graphics hardware can transform and render triangles much faster than you can submit them. If you submit few triangles with each call, you will be completely bound by the CPU and the GPU will be mostly idle. The CPU won't be able to feed the GPU fast enough.
Making a single draw call with two triangles is cheap, but if you submit too little data with each call, you won't have enough CPU time to submit as much geometry to the GPU as you could have.
There are some real costs with making draw calls, it requires setting up a bunch of state (which set of vertices to use, what shader to use and so on), and state changes have a cost both on the hardware side (updating a bunch of registers) and on the driver side (validating and translating your calls that set state).
But the main cost of draw calls only apply if each call submits too little data, since this will cause you to be CPU-bound, and stop you from utilizing the hardware fully.
Just like Josh said, draw calls can also cause the command buffer to be flushed, but in my experience that usually happens when you call SwapBuffers, not when submitting geometry. Video drivers generally try to buffer as much as they can get away with (several frames sometimes!) to squeeze out as much parallelism from the GPU as possible.
You should read the nVidia presentation Batch Batch Batch!, it's fairly old but covers exactly this topic.
Graphics APIs like Direct3D translate their API-level calls into device-agnostic commands and queue them up in a buffer. Flushing that buffer, to perform actual work, is expensive -- both because it implies the actual work is now being performed, and because it can incur a switch from user to kernel mode on the chip (and back again), which is not that cheap.
Until the buffer is flushed, the GPU is able to do some prep work in parallel with the CPU, so long as the CPU doesn't make a blocking request (such as mapping data back to the CPU). But the GPU won't -- and can't -- prepare everything until it needs to actually draw. Just because some vertex or texture data is on the card doesn't mean it's arranged appropriately yet, and may not be arrangeable until vertex layouts are set or shaders are bound, et cetera. The bulk of the real work happens during the command flush and draw call.
The DirectX SDK has a section on accurately profiling D3D performance which, while not directly related to your question, can supply some hints as to what is and is not expensive and (in some cases) why.
More relevant is this blog post (and the follow-up posts here and here), which provide a good overview of the logical, low-level operational process of the GPU.
But, essentially (to try and directly answer your questions), the reason the calls are expensive isn't that there is necessarily a lot of data to transfer, but rather that there is a large body of work beyond just shipping data across the bus that gets deferred until the command buffer is flushed.
Short answer: The driver buffers some or all of the actual the work until you call draw. This will show up as a relatively predictable amount of time spent in the draw call, depending how much state has changed.
This is done for a few reasons:
to avoid doing unnecessary work: If you (unnecessarily) set the same state multiple times before drawing it can avoid doing expensive work each time this occurs. This actually becomes a fairly common occurrence in a large codebase, say a production game engine.
to be able to reconcile what internally are interdependent states instead of processing them immediately with incomplete information
Alternate answer(s):
The buffer the driver uses to store rendering commands is full and the app is effectively waiting for the GPU to process some of the earlier work. This will typically show up as extremely large chunks of time blocking in a random draw call within a frame.
The number of frames that the driver is allowed to buffer up has been reached and the app is waiting on the GPU to process one of them. This will typically show up as a large chunk of time blocking in the first draw call within a frame, or on Present at the end of the previous frame.

Resources