why are draw calls expensive? - performance

assuming the texture, vertex, and shader data are already on the graphics card, you don't need to send much data to the card. there's a few bytes to identify the data, and presumably a 4x4 matrix, and some assorted other parameters.
so where is all of the overhead coming from? do the operations require a handshake of some sort with the gpu?
why is sending a single mesh containing a bunch of small models, calculated on the CPU, often faster than sending the vertex id and transformation matrices? (the second option looks like there should be less data sent, unless the models are smaller than a 4x4 matrix)

First of all, I'm assuming that with "draw calls", you mean the command that tells the GPU to render a certain set of vertices as triangles with a certain state (shaders, blend state and so on).
Draw calls aren't necessarily expensive. In older versions of Direct3D, many calls required a context switch, which was expensive, but this isn't true in newer versions.
The main reason to make fewer draw calls is that graphics hardware can transform and render triangles much faster than you can submit them. If you submit few triangles with each call, you will be completely bound by the CPU and the GPU will be mostly idle. The CPU won't be able to feed the GPU fast enough.
Making a single draw call with two triangles is cheap, but if you submit too little data with each call, you won't have enough CPU time to submit as much geometry to the GPU as you could have.
There are some real costs with making draw calls, it requires setting up a bunch of state (which set of vertices to use, what shader to use and so on), and state changes have a cost both on the hardware side (updating a bunch of registers) and on the driver side (validating and translating your calls that set state).
But the main cost of draw calls only apply if each call submits too little data, since this will cause you to be CPU-bound, and stop you from utilizing the hardware fully.
Just like Josh said, draw calls can also cause the command buffer to be flushed, but in my experience that usually happens when you call SwapBuffers, not when submitting geometry. Video drivers generally try to buffer as much as they can get away with (several frames sometimes!) to squeeze out as much parallelism from the GPU as possible.
You should read the nVidia presentation Batch Batch Batch!, it's fairly old but covers exactly this topic.

Graphics APIs like Direct3D translate their API-level calls into device-agnostic commands and queue them up in a buffer. Flushing that buffer, to perform actual work, is expensive -- both because it implies the actual work is now being performed, and because it can incur a switch from user to kernel mode on the chip (and back again), which is not that cheap.
Until the buffer is flushed, the GPU is able to do some prep work in parallel with the CPU, so long as the CPU doesn't make a blocking request (such as mapping data back to the CPU). But the GPU won't -- and can't -- prepare everything until it needs to actually draw. Just because some vertex or texture data is on the card doesn't mean it's arranged appropriately yet, and may not be arrangeable until vertex layouts are set or shaders are bound, et cetera. The bulk of the real work happens during the command flush and draw call.
The DirectX SDK has a section on accurately profiling D3D performance which, while not directly related to your question, can supply some hints as to what is and is not expensive and (in some cases) why.
More relevant is this blog post (and the follow-up posts here and here), which provide a good overview of the logical, low-level operational process of the GPU.
But, essentially (to try and directly answer your questions), the reason the calls are expensive isn't that there is necessarily a lot of data to transfer, but rather that there is a large body of work beyond just shipping data across the bus that gets deferred until the command buffer is flushed.

Short answer: The driver buffers some or all of the actual the work until you call draw. This will show up as a relatively predictable amount of time spent in the draw call, depending how much state has changed.
This is done for a few reasons:
to avoid doing unnecessary work: If you (unnecessarily) set the same state multiple times before drawing it can avoid doing expensive work each time this occurs. This actually becomes a fairly common occurrence in a large codebase, say a production game engine.
to be able to reconcile what internally are interdependent states instead of processing them immediately with incomplete information
Alternate answer(s):
The buffer the driver uses to store rendering commands is full and the app is effectively waiting for the GPU to process some of the earlier work. This will typically show up as extremely large chunks of time blocking in a random draw call within a frame.
The number of frames that the driver is allowed to buffer up has been reached and the app is waiting on the GPU to process one of them. This will typically show up as a large chunk of time blocking in the first draw call within a frame, or on Present at the end of the previous frame.

Related

CPU and GPU in 3D game: who's doing what?

I'm confused regarding the boundary between application (CPU side) vs. the GPU side. Can someone help me understand what the application is generally responsible for in a game?
My understanding is that the application submits frames for the GPU to render, a process that involves vertex shader, rasterization, and pixel shader (in the most basic rendering form). This leads me to believe that the GPU has no concept of what occurs from frame to frame.
Does this mean the application is keeping track of where all the objects are in world space? And if the user moves a character (for example), does the application determine the new location and therefore submit a new transform to the GPU?
This is confusing especially because I read that the vertex shader can be used for things like morphing, which is basically animating models over time based on two static poses.
I haven't kept in touch with the latest state-of-the-art engines, but last time I checked, almost all of the game state is typically stored on CPU and data is much more often written to from CPU to GPU, not read back nearly as often.
That can even include redundant data stored on the CPU (and not talking about what drivers do). For example, an engine might store triangle meshes in both CPU and GPU, keeping the mesh in CPU as well to do things like frustum culling, collision detection, and picking. One of the reasons for this redundancy is because it'd be too difficult, if not impossible in some cases, to write the bulk of the game logic in GPU code. For example, you might be able to accelerate some parts of collision detection on the GPU side but you can't make your physics system written entirely in GPU push events from the physics systems for the audio system to play sounds on collision events (the GPU can't communicate with audio hardware). The other is that the GPU is generally more limited in terms of memory, e.g., so the CPU might store all the images for a game map (I'm including the hard drive as "CPU memory") in addition to storing some of the active texture data redundantly on the GPU (and the textures might use a lower resolution depending on the circumstance).
The GPU is still a very specialized piece of hardware. You can't even write Pong on the GPU, after all, since it can't directly read user input from, say, keyboard, mouse, gamepad, or play audio, or load files from a hard drive, or anything like that. The CPU is still like the "master brain" that coordinates everything.
As for things like tweening and skinning, that's often computed on the GPU but that's not like "state management". The CPU might still store the matrices for each bone in a bone hierarchy and then just ship those matrices and undeformed vertex positions to the GPU and let it compute stuff on the fly. In that scenario it's not game state being stored/managed on the GPU so much as letting the GPU compute data on the fly on a per-frame basis, which it can do super fast in these scenarios, that doesn't even need to be persistently stored in the first place. The CPU doesn't even read back the resulting data in those cases.
Typically the GPU isn't used that much for managing state. It's more often used for computing things on the fly really fast where it's suitable for doing that. If there's state stored there, it's often temporary state that can be discarded and regenerated because the CPU already has sufficient data to do so. An exception would be some GPGPU software I've seen where they actually stored some application state exclusively on the GPU with no copy whatsoever on the CPU with the CPU doing more reads from the GPU than writes to it, but I don't think games are doing that quite as much.
So for the most part, yes, typically the GPU is rather oblivious of the game world and state. The CPU just uses it store some discardable data temporarily here and there, like texture data and VBOs from meshes and images it already has copies of, and uses the GPU to compute and output a lot of discardable data on the fly really fast. It's not used that frequently to store and output persistent data.
If I try to come up with a crude analogy, it's like business managers at a pizza restaurant would persistently store customer records like their addresses. They might temporarily give an address to the pizza delivery guy with his fast motorcycle to deliver a pizza to the customer, but they aren't exclusively going to leave the pizza deliver guys to keep track of every customer's address, since that would lead to too much back-and-forth communication (plus those pizza delivery guys might not be able to remember every single customer address they deliver pizzas to while the managers have like computers with databases to store a boatload of customer data). It's mostly one-way communication from business manager->pizza delivery guy. So it's like, "Hey pizza guy with fast motorcycle, go deliver a pizza to this address", and similar thing for CPU to GPU: "Hey GPU, go calculate this for me really fast and here's the data you need to do it."

Rendering in DirectX 11

When frame starts, I do my logical update and render after that.
In my render code I do usual stuff. I set few states, buffors, textures, and end by calling Draw.
m_deviceContext->Draw(
nbVertices,
0);
At frame end I call present to show rendered frame.
// Present the back buffer to the screen since rendering is complete.
if(m_vsync_enabled)
{
// Lock to screen refresh rate.
m_swapChain->Present(1, 0);
}
else
{
// Present as fast as possible.
m_swapChain->Present(0, 0);
}
Usual stuff. Now, when I call Draw, according to MSDN
Draw submits work to the rendering pipeline.
Does it mean that data is send to GPU and main thread (the one called Draw) continues? Or does it wait for rendering to finish?
In my opinion, only Present function should make main thread wait for rendering to finish.
There are a number of calls which can trigger the GPU to start working, Draw being one. Other's include Dispatch, CopyResource, etc. What the MSDN docs are trying to say is that stuff like PSSetShader. IASetPrimitiveTopology, etc. doesn't really do anything until you call Draw.
When you call Present that is taken as an implicit indicator of 'end of frame' but your program can often continue on with setting up rendering calls for the next frame well before the first frame is done and showing. By default, Windows will let you queue up to 3 frames ahead before blocking your CPU thread on the Present call to let the GPU catch-up--in real-time rendering you usually don't want the latency between input and render to be really high.
The fact is, however, that GPU/CPU synchronization is complicated and the Direct3D runtime is also batcning up requests to minimize kernel-call overhead so the actual work could be happing after many Draws are submitted to the command-queue. This old article gives you the flavor of how this works. On modern GPUs, you can also have various memory operations for paging in memory, setting up physical video memory areas, etc.
BTW, all this 'magic' doesn't exist with Direct3D 12 but that means the application has to do everything at the 'right' time to ensure it is both efficient and functional. The programmer is much more directly building up command-queues, triggering work on various pixel and compute GPU engines, and doing all the messy stuff that is handled a little more abstracted and automatically by Direct3 11's runtime. Even still, ultimately the video driver is the one actually talking to the hardware so they can do other kinds of optimizations as well.
The general rules of thumb here to keep in mind:
Creating resources is expensive, especially runtime shader compilation (by HLSL complier) and runtime shader blob optimization (by driver)
Copying resources to the GPU (i.e. loading texture data from the CPU memory) requires bus bandwidth that is limited in supply: Prefer to keep textures, VB, and IB data in Static buffers you reuse.
Copying resources from the GPU (i.e. moving GPU memory to CPU memory) uses a backchannel that is slower than going to the GPU: try to avoid the need for readback from the GPU
Submitting larger chunks of geometry per Draw call helps to amortize overhead (i.e. calling draw once for 10,000 triangles with the same state/shader is much faster than calling draw 10 times for a 1000 triangles each with changing state/shaders between).

Multiple contexts per application vs multiple applications per context

I was wondering whether it is a good idea to create a "system" wide rendering server that is responsible for the rendering of all application elements. Currently, applications usually have their own context, meaning whatever data might be identical across different applications, it will be duplicated in GPU memory and the more frequent resource management calls only decrease the count of usable render calls. From what I understand, the OpenGL execution engine/server itself is sequential/single threaded in design. So technically, everything that might be reused across applications, and especially heavy stuff like bitmap or geometry caches for text and UI, is just clogging the server with unnecessary transfers and memory usage.
Are there any downsides to having a scenegraph shared across multiple applications? Naturally, assuming the correct handling of clients which accidentally freeze.
I was wondering whether it is a good idea to create a "system" wide rendering server that is responsible for the rendering of all application elements.
That depends on the task at hand. A small detour: Take a webbrowser for example, where JavaScript performs manipulations on the DOM; CSS transform and SVG elements define graphical elements. Each JavaScript called in response to an event may run as a separate thread/lighweight process. In a matter of sense the webbrowser is a rendering engine (heck they're internally even called rendering engines) for a whole bunch of applications.
And for that it's a good idea.
And in general display servers are a very good thing. Just have a look at X11, which has an incredible track record. These days Wayland is all the hype, and a lot of people drank the Kool-Aid, but you actually want the abstraction of a display server. However not for the reasons you thought. The main reason to have a display server is to avoid redundant code (not redundant data) and to have only a single entity to deal with the dirty details (color spaces, device physical properties) and provide optimized higher order drawing primitives.
But in regard with the direct use of OpenGL none of those considerations matter:
Currently, applications usually have their own context, meaning whatever data might be identical across different applications,
So? Memory is cheap. And you don't gain performance by coalescing duplicate data, because the only thing that matters for performance is the memory bandwidth required to process this data. But that bandwidth doesn't change because it only depends on the internal structure of the data, which however is unchanged by coalescing.
In fact deduplication creates significant overhead, since when one application made changes, that are not to affect the other application a copy-on-write operation has to be invoked which is not for free, usually means a full copy, which however means that while making the whole copy the memory bandwidth is consumed.
However for a small, selected change in the data of one application, with each application having its own copy the memory bus is blocked for much shorter time.
it will be duplicated in GPU memory and the more frequent resource management calls only decrease the count of usable render calls.
Resource management and rendering normally do not interfere with each other. While the GPU is busy turning scalar values into points, lines and triangles, the driver on the CPU can do the housekeeping. In fact a lot of performance is gained by keeping making the CPU do non-rendering related work while the GPU is busy rendering.
From what I understand, the OpenGL execution engine/server itself is sequential/single threaded in design
Where did you read that? There's no such constraint/requirement on this in the OpenGL specifications and real OpenGL implementations (=drivers) are free to parallelize as much as they want.
just clogging the server with unnecessary transfers and memory usage.
Transfer happens only once, when the data gets loaded. Memory bandwidth consumption is unchanged by deduplication. And memory is so cheap these days, that data deduplication simply isn't worth the effort.
Are there any downsides to having a scenegraph shared across multiple applications? Naturally, assuming the correct handling of clients which accidentally freeze.
I think you completely misunderstand the nature of OpenGL. OpenGL is not a scene graph. There's no scene, there are mo models in OpenGL. Each applications has its own layout of data and eventually this data gets passed into OpenGL to draw pixels onto the screen.
To OpenGL however there are just drawing commands to turn arrays of scalar values into points, lines and triangles on the screen. There's nothing more to it.

How to temporarily disable OpenGL command queueing, for more accurate profiling results?

In Mac OS X's OpenGL Profiler app, I can get statistics regarding how long each GL function call takes. However, the results show that a ton of time is spent in flush commands (glFlush, glFlushRenderAPPLE, CGLFlushDrawable) and in glDrawElements, and every other GL function call's time is negligibly small.
I assume this is because OpenGL is enqueueing the commands I submit, and waiting until flushing or drawing to actually execute the commands.
I guess I could do something like this:
glFlush();
startTiming();
glDoSomething();
glFlush();
stopTimingAndRecordDelta();
...and insert that pattern around every GL function call my app makes, but that would be tedious since there are thousands of GL function calls throughout my app, and I'd have to tabulate the results manually (instead of using the already-existent OpenGL Profiler tool).
So, is there a way to disable all OpenGL command queueing, so I can get more accurate profiling results?
So, is there a way to disable all OpenGL command queueing, ...
No, there isn't an OpenGL function that does that.
..., so I can get more accurate profiling results?
You can get more accurate information than you are currently, but you'll never get really precise answers (but you can probably get what you need). While the results of OpenGL rendering are the "same" — OpenGL's not guaranteed to be pixel-accurate across implementations — they're supposed to be very close. However, how the pixels are generated can vary drastically. In particular, tiled-reneders (in mobile and embedded devices) usually don't render pixels during a draw call, but rather queue up the geometry, and generate the pixels at buffer swap.
That said, for profiling OpenGL, you want to use glFinish, instead of glFlush. glFinish will force all pending OpenGL calls to complete and return; glFlush merely requests that commands be sent to the OpenGL "at some time in the future", so it's not deterministic. Be sure to remove your glFinish in your "production" code, since it will really slow down your application. From your example, if you replace the flushes with finishes in your example, you'll get more interesting information.
You are using OpenGL 3, and in particular discussing OS X. Mavericks (10.9) supports Timer Queries, which you can use to time a single GL operation or an entire sequence of operations at the pipeline level. That is, how long they take to execute when GL actually gets around to performing them, rather than timing how long a particular API call takes to return (which is often meaningless). You can only have a single timer query in the pipeline at a given time unfortunately, so you may have to structure your software cleverly to make best use of them if you want command-level granularity.
I use them in my own work to time individual stages of the graphics engine. Things like how long it takes to update shadow maps, build the G-Buffers, perform deferred / forward lighting, individual HDR post-processing effects, etc. It really helps identify bottlenecks if you structure the timer queries this way instead of focusing on individual commands.
For instance on some filtrate limited hardware shadow map generation is the biggest bottleneck, on other shader limited hardware, lighting is. You can even use the results to determine the optimal shadow map resolution or lighting quality to meet a target framerate for a particular host without requiring the user to set these parameters manually. If you simply timed how long the individual operations took you would never get the bigger picture, but if you time entire sequences of commands that actually do some major part of your rendering you get neatly packed information that can be a lot more useful than even the output from profilers.

Drawing triangles with CUDA

I'm writing my own graphics library (yep, its homework:) and use cuda to do all rendering and calculations fast.
I have problem with drawing filled triangles. I wrote it such a way that one process draw one triangle. It works pretty fine when there are a lot of small triangles on the scene, but it breaks performance totally when triangles are big.
My idea is to do two passes. In first calculate only tab with information about scanlines (draw from here to there). This would be triangle per process calculation like in current algorithm. And in second pass really draw the scanlines with more than one process per triangle.
But will it be fast enough? Maybe there is some better solution?
You can check this blog: A Software Rendering Pipeline in CUDA. I don't think that's the optimal way to do it, but at least the author shares some useful sources.
Second, read this paper: A Programmable, Parallel Rendering Architecture. I think it's one of the most recent paper and it's also CUDA based.
If I had to do this, I would go with a Data-Parallel Rasterization Pipeline like in Larrabee (which is TBR) or even REYES and adapt it to CUDA:
http://www.ddj.com/architect/217200602
http://home.comcast.net/~tom_forsyth/larrabee/Standford%20Forsyth%20Larrabee%202010.zip (see the second part of the presentation)
http://graphics.stanford.edu/papers/mprast/
I suspect that you have some misconceptions about CUDA and how to use it, especially since you refer to a "process" when, in CUDA terminology, there is no such thing.
For most CUDA applications, there are two important things to getting good performance: optimizing memory access and making sure each 'active' CUDA thread in a warp performs the same operation at the same time as otehr active threads in the warp. Both of these sound like they are important for your application.
To optimize your memory access, you want to make sure that your reads from global memory and your writes to global memory are coalesced. You can read more about this in the CUDA programming guide, but it essentially means, adjacent threads in a half warp must read from or write to adjacent memory locations. Also, each thread should read or write 4, 8 or 16 bytes at a time.
If your memory access pattern is random, then you might need to consider using texture memory. When you need to refer to memory that has been read by other threads in a block, then you should make use of shared memory.
In your case, I'm not sure what your input data is, but you should at least make sure that your writes are coalesced. You will probably have to invest some non-trivial amount of effort to get your reads to work efficiently.
For the second part, I would recommend that each CUDA thread process one pixel in your output image. With this strategy, you should watch out for loops in your kernels that will execute longer or shorter depending on the per-thread data. Each thread in your warps should perform the same number of steps in the same order. The only exception to this is that there is no real performance penalty for having some threads in a warp perform no operation while the remaining threads perform the same operation together.
Thus, I would recommend having each thread check if its pixel is inside a given triangle. If not, it should do nothing. If it is, it should compute the output color for that pixel.
Also, I'd strongly recommend reading more about CUDA as it seems like you are jumping into the deep end without having a good understanding of some of the basic fundamentals.
Not to be rude, but isn't this what graphics cards are designed to do anyway? Seems like using the standard OpenGL and Direct3D APIs would make more sense.
Why not use the APIs to do your basic rendering, rather than CUDA, which is much lower-level? Then, if you wish to do additional operations that are not supported, you can use CUDA to apply them on top. Or maybe implement them as shaders.

Resources