Process THREE.Texture in webworkers - three.js

I would like to process THREE.Texture in web workers to help post processing effects such as bloom. For instance for the bloom effect, I would first draw the scene on a THREE.Texture object and then I would like to handle the blur process in the web worker. What would be the most efficient way to pass the THREE.Texture data to the worker and create a new THREE.Texture from the data obtained from the worker. Since I would do that 60 times per second ideally, I need a fast and memory friendly way to do that (memory friendly: not to create new objects in a loop but rather re-use existing objects).
I'm aware that canvas2DContext.getImageData may be helpful but that probably is not the best way since I'd draw to canvas 60 times per second and that would slow things down.
Thanks!
PS: I should specify that in this approach, I don't intend to wait for the worker to finish processing the texture to render the final result. Since most of the objects are static I don't think that would be a big deal anyway. I wanna test it to see how it goes for the dynamic objects though.

Passing a GPU based texture to a web worker would not speed up anything in fact it would be significantly slower.
It's extremely slow to transfer memory from the GPU to the CPU (and CPU to GPU as well) relative to doing everything on the GPU. The only way to pass the contents of a texture to a worker is to ask WebGL to copy from the GPU to the CPU (using gl.readPixels on in three whatever it's wrapper for gl.readPixels is) and then transfer the result to the worker. Then in the worker all you could do is a slow CPU based blur (slower than it would have been on the GPU), then you'd have to transfer it back to the main thread only to upload it again via gl.texImage2D or telling three.js to do it for you which is also a slow operation copying the data from the CPU back to the GPU.
The fast way to apply a blur is to do it on the GPU.
Further, there is no way to share WebGL resources between the main thread and a worker nor is there likely to be anytime soon. Even if you could share the resource and then from the worker ask the GPU to do the blur that would save no time as well as for the most part GPUs don't run different operations in parallel (not generically multi-process like CPUs) so all you'd end up doing is asking the GPU to do the same amount of work.

Related

CPU and GPU in 3D game: who's doing what?

I'm confused regarding the boundary between application (CPU side) vs. the GPU side. Can someone help me understand what the application is generally responsible for in a game?
My understanding is that the application submits frames for the GPU to render, a process that involves vertex shader, rasterization, and pixel shader (in the most basic rendering form). This leads me to believe that the GPU has no concept of what occurs from frame to frame.
Does this mean the application is keeping track of where all the objects are in world space? And if the user moves a character (for example), does the application determine the new location and therefore submit a new transform to the GPU?
This is confusing especially because I read that the vertex shader can be used for things like morphing, which is basically animating models over time based on two static poses.
I haven't kept in touch with the latest state-of-the-art engines, but last time I checked, almost all of the game state is typically stored on CPU and data is much more often written to from CPU to GPU, not read back nearly as often.
That can even include redundant data stored on the CPU (and not talking about what drivers do). For example, an engine might store triangle meshes in both CPU and GPU, keeping the mesh in CPU as well to do things like frustum culling, collision detection, and picking. One of the reasons for this redundancy is because it'd be too difficult, if not impossible in some cases, to write the bulk of the game logic in GPU code. For example, you might be able to accelerate some parts of collision detection on the GPU side but you can't make your physics system written entirely in GPU push events from the physics systems for the audio system to play sounds on collision events (the GPU can't communicate with audio hardware). The other is that the GPU is generally more limited in terms of memory, e.g., so the CPU might store all the images for a game map (I'm including the hard drive as "CPU memory") in addition to storing some of the active texture data redundantly on the GPU (and the textures might use a lower resolution depending on the circumstance).
The GPU is still a very specialized piece of hardware. You can't even write Pong on the GPU, after all, since it can't directly read user input from, say, keyboard, mouse, gamepad, or play audio, or load files from a hard drive, or anything like that. The CPU is still like the "master brain" that coordinates everything.
As for things like tweening and skinning, that's often computed on the GPU but that's not like "state management". The CPU might still store the matrices for each bone in a bone hierarchy and then just ship those matrices and undeformed vertex positions to the GPU and let it compute stuff on the fly. In that scenario it's not game state being stored/managed on the GPU so much as letting the GPU compute data on the fly on a per-frame basis, which it can do super fast in these scenarios, that doesn't even need to be persistently stored in the first place. The CPU doesn't even read back the resulting data in those cases.
Typically the GPU isn't used that much for managing state. It's more often used for computing things on the fly really fast where it's suitable for doing that. If there's state stored there, it's often temporary state that can be discarded and regenerated because the CPU already has sufficient data to do so. An exception would be some GPGPU software I've seen where they actually stored some application state exclusively on the GPU with no copy whatsoever on the CPU with the CPU doing more reads from the GPU than writes to it, but I don't think games are doing that quite as much.
So for the most part, yes, typically the GPU is rather oblivious of the game world and state. The CPU just uses it store some discardable data temporarily here and there, like texture data and VBOs from meshes and images it already has copies of, and uses the GPU to compute and output a lot of discardable data on the fly really fast. It's not used that frequently to store and output persistent data.
If I try to come up with a crude analogy, it's like business managers at a pizza restaurant would persistently store customer records like their addresses. They might temporarily give an address to the pizza delivery guy with his fast motorcycle to deliver a pizza to the customer, but they aren't exclusively going to leave the pizza deliver guys to keep track of every customer's address, since that would lead to too much back-and-forth communication (plus those pizza delivery guys might not be able to remember every single customer address they deliver pizzas to while the managers have like computers with databases to store a boatload of customer data). It's mostly one-way communication from business manager->pizza delivery guy. So it's like, "Hey pizza guy with fast motorcycle, go deliver a pizza to this address", and similar thing for CPU to GPU: "Hey GPU, go calculate this for me really fast and here's the data you need to do it."

Rendering in DirectX 11

When frame starts, I do my logical update and render after that.
In my render code I do usual stuff. I set few states, buffors, textures, and end by calling Draw.
m_deviceContext->Draw(
nbVertices,
0);
At frame end I call present to show rendered frame.
// Present the back buffer to the screen since rendering is complete.
if(m_vsync_enabled)
{
// Lock to screen refresh rate.
m_swapChain->Present(1, 0);
}
else
{
// Present as fast as possible.
m_swapChain->Present(0, 0);
}
Usual stuff. Now, when I call Draw, according to MSDN
Draw submits work to the rendering pipeline.
Does it mean that data is send to GPU and main thread (the one called Draw) continues? Or does it wait for rendering to finish?
In my opinion, only Present function should make main thread wait for rendering to finish.
There are a number of calls which can trigger the GPU to start working, Draw being one. Other's include Dispatch, CopyResource, etc. What the MSDN docs are trying to say is that stuff like PSSetShader. IASetPrimitiveTopology, etc. doesn't really do anything until you call Draw.
When you call Present that is taken as an implicit indicator of 'end of frame' but your program can often continue on with setting up rendering calls for the next frame well before the first frame is done and showing. By default, Windows will let you queue up to 3 frames ahead before blocking your CPU thread on the Present call to let the GPU catch-up--in real-time rendering you usually don't want the latency between input and render to be really high.
The fact is, however, that GPU/CPU synchronization is complicated and the Direct3D runtime is also batcning up requests to minimize kernel-call overhead so the actual work could be happing after many Draws are submitted to the command-queue. This old article gives you the flavor of how this works. On modern GPUs, you can also have various memory operations for paging in memory, setting up physical video memory areas, etc.
BTW, all this 'magic' doesn't exist with Direct3D 12 but that means the application has to do everything at the 'right' time to ensure it is both efficient and functional. The programmer is much more directly building up command-queues, triggering work on various pixel and compute GPU engines, and doing all the messy stuff that is handled a little more abstracted and automatically by Direct3 11's runtime. Even still, ultimately the video driver is the one actually talking to the hardware so they can do other kinds of optimizations as well.
The general rules of thumb here to keep in mind:
Creating resources is expensive, especially runtime shader compilation (by HLSL complier) and runtime shader blob optimization (by driver)
Copying resources to the GPU (i.e. loading texture data from the CPU memory) requires bus bandwidth that is limited in supply: Prefer to keep textures, VB, and IB data in Static buffers you reuse.
Copying resources from the GPU (i.e. moving GPU memory to CPU memory) uses a backchannel that is slower than going to the GPU: try to avoid the need for readback from the GPU
Submitting larger chunks of geometry per Draw call helps to amortize overhead (i.e. calling draw once for 10,000 triangles with the same state/shader is much faster than calling draw 10 times for a 1000 triangles each with changing state/shaders between).

Multiple contexts per application vs multiple applications per context

I was wondering whether it is a good idea to create a "system" wide rendering server that is responsible for the rendering of all application elements. Currently, applications usually have their own context, meaning whatever data might be identical across different applications, it will be duplicated in GPU memory and the more frequent resource management calls only decrease the count of usable render calls. From what I understand, the OpenGL execution engine/server itself is sequential/single threaded in design. So technically, everything that might be reused across applications, and especially heavy stuff like bitmap or geometry caches for text and UI, is just clogging the server with unnecessary transfers and memory usage.
Are there any downsides to having a scenegraph shared across multiple applications? Naturally, assuming the correct handling of clients which accidentally freeze.
I was wondering whether it is a good idea to create a "system" wide rendering server that is responsible for the rendering of all application elements.
That depends on the task at hand. A small detour: Take a webbrowser for example, where JavaScript performs manipulations on the DOM; CSS transform and SVG elements define graphical elements. Each JavaScript called in response to an event may run as a separate thread/lighweight process. In a matter of sense the webbrowser is a rendering engine (heck they're internally even called rendering engines) for a whole bunch of applications.
And for that it's a good idea.
And in general display servers are a very good thing. Just have a look at X11, which has an incredible track record. These days Wayland is all the hype, and a lot of people drank the Kool-Aid, but you actually want the abstraction of a display server. However not for the reasons you thought. The main reason to have a display server is to avoid redundant code (not redundant data) and to have only a single entity to deal with the dirty details (color spaces, device physical properties) and provide optimized higher order drawing primitives.
But in regard with the direct use of OpenGL none of those considerations matter:
Currently, applications usually have their own context, meaning whatever data might be identical across different applications,
So? Memory is cheap. And you don't gain performance by coalescing duplicate data, because the only thing that matters for performance is the memory bandwidth required to process this data. But that bandwidth doesn't change because it only depends on the internal structure of the data, which however is unchanged by coalescing.
In fact deduplication creates significant overhead, since when one application made changes, that are not to affect the other application a copy-on-write operation has to be invoked which is not for free, usually means a full copy, which however means that while making the whole copy the memory bandwidth is consumed.
However for a small, selected change in the data of one application, with each application having its own copy the memory bus is blocked for much shorter time.
it will be duplicated in GPU memory and the more frequent resource management calls only decrease the count of usable render calls.
Resource management and rendering normally do not interfere with each other. While the GPU is busy turning scalar values into points, lines and triangles, the driver on the CPU can do the housekeeping. In fact a lot of performance is gained by keeping making the CPU do non-rendering related work while the GPU is busy rendering.
From what I understand, the OpenGL execution engine/server itself is sequential/single threaded in design
Where did you read that? There's no such constraint/requirement on this in the OpenGL specifications and real OpenGL implementations (=drivers) are free to parallelize as much as they want.
just clogging the server with unnecessary transfers and memory usage.
Transfer happens only once, when the data gets loaded. Memory bandwidth consumption is unchanged by deduplication. And memory is so cheap these days, that data deduplication simply isn't worth the effort.
Are there any downsides to having a scenegraph shared across multiple applications? Naturally, assuming the correct handling of clients which accidentally freeze.
I think you completely misunderstand the nature of OpenGL. OpenGL is not a scene graph. There's no scene, there are mo models in OpenGL. Each applications has its own layout of data and eventually this data gets passed into OpenGL to draw pixels onto the screen.
To OpenGL however there are just drawing commands to turn arrays of scalar values into points, lines and triangles on the screen. There's nothing more to it.

Thread effectiveness for graphical applications

Let's say I'm creating lighting for a scene, using my own shaders. It is a good example of a thing that might be divided between many threads, for example by dividing scene into smaller scenes and rendering them in separate threads.
Should I divide it into threads manually, or is graphic library going to somehow automatically divide such operations? Or is it library dependent (i'm using libgdx, which appears to be using OpenGL). Or maybe there is other reason why I should leave it alone in one thread?
If I should take care of dividing workload between threads manually, how many threads should I use? Is the number of threads in such situation graphic card dependent or processor dependent?
OpenGL does not support multi-threaded rendering since an OpenGL context is only valid on the thread which it is created.
What you could do to potentially gain some performance is separate your update logic and your rendering logic into separate threads. However, you can not leverage multiple threads for OpenGL rendering.

why are draw calls expensive?

assuming the texture, vertex, and shader data are already on the graphics card, you don't need to send much data to the card. there's a few bytes to identify the data, and presumably a 4x4 matrix, and some assorted other parameters.
so where is all of the overhead coming from? do the operations require a handshake of some sort with the gpu?
why is sending a single mesh containing a bunch of small models, calculated on the CPU, often faster than sending the vertex id and transformation matrices? (the second option looks like there should be less data sent, unless the models are smaller than a 4x4 matrix)
First of all, I'm assuming that with "draw calls", you mean the command that tells the GPU to render a certain set of vertices as triangles with a certain state (shaders, blend state and so on).
Draw calls aren't necessarily expensive. In older versions of Direct3D, many calls required a context switch, which was expensive, but this isn't true in newer versions.
The main reason to make fewer draw calls is that graphics hardware can transform and render triangles much faster than you can submit them. If you submit few triangles with each call, you will be completely bound by the CPU and the GPU will be mostly idle. The CPU won't be able to feed the GPU fast enough.
Making a single draw call with two triangles is cheap, but if you submit too little data with each call, you won't have enough CPU time to submit as much geometry to the GPU as you could have.
There are some real costs with making draw calls, it requires setting up a bunch of state (which set of vertices to use, what shader to use and so on), and state changes have a cost both on the hardware side (updating a bunch of registers) and on the driver side (validating and translating your calls that set state).
But the main cost of draw calls only apply if each call submits too little data, since this will cause you to be CPU-bound, and stop you from utilizing the hardware fully.
Just like Josh said, draw calls can also cause the command buffer to be flushed, but in my experience that usually happens when you call SwapBuffers, not when submitting geometry. Video drivers generally try to buffer as much as they can get away with (several frames sometimes!) to squeeze out as much parallelism from the GPU as possible.
You should read the nVidia presentation Batch Batch Batch!, it's fairly old but covers exactly this topic.
Graphics APIs like Direct3D translate their API-level calls into device-agnostic commands and queue them up in a buffer. Flushing that buffer, to perform actual work, is expensive -- both because it implies the actual work is now being performed, and because it can incur a switch from user to kernel mode on the chip (and back again), which is not that cheap.
Until the buffer is flushed, the GPU is able to do some prep work in parallel with the CPU, so long as the CPU doesn't make a blocking request (such as mapping data back to the CPU). But the GPU won't -- and can't -- prepare everything until it needs to actually draw. Just because some vertex or texture data is on the card doesn't mean it's arranged appropriately yet, and may not be arrangeable until vertex layouts are set or shaders are bound, et cetera. The bulk of the real work happens during the command flush and draw call.
The DirectX SDK has a section on accurately profiling D3D performance which, while not directly related to your question, can supply some hints as to what is and is not expensive and (in some cases) why.
More relevant is this blog post (and the follow-up posts here and here), which provide a good overview of the logical, low-level operational process of the GPU.
But, essentially (to try and directly answer your questions), the reason the calls are expensive isn't that there is necessarily a lot of data to transfer, but rather that there is a large body of work beyond just shipping data across the bus that gets deferred until the command buffer is flushed.
Short answer: The driver buffers some or all of the actual the work until you call draw. This will show up as a relatively predictable amount of time spent in the draw call, depending how much state has changed.
This is done for a few reasons:
to avoid doing unnecessary work: If you (unnecessarily) set the same state multiple times before drawing it can avoid doing expensive work each time this occurs. This actually becomes a fairly common occurrence in a large codebase, say a production game engine.
to be able to reconcile what internally are interdependent states instead of processing them immediately with incomplete information
Alternate answer(s):
The buffer the driver uses to store rendering commands is full and the app is effectively waiting for the GPU to process some of the earlier work. This will typically show up as extremely large chunks of time blocking in a random draw call within a frame.
The number of frames that the driver is allowed to buffer up has been reached and the app is waiting on the GPU to process one of them. This will typically show up as a large chunk of time blocking in the first draw call within a frame, or on Present at the end of the previous frame.

Resources