Render loop - maximum parallelization

Render loop - maximum parallelization - parallel-processing

Below is a UML sequence diagram showing processing time on my understanding of the game loop in the library libGDX. I think it should be the same architecture for every other game library. I am not sure if I understood it correctly.
In theory CPU and GPU work in parallel. When the CPU waits until the GPU is finished for the buffer change this makes it a serial process.
How can make my game loop work in parallel or is my understanding wrong?
Now image we want to have parallelisation and that the GPU is slower than the CPU and the CPU continues with the next frame while the GPU is rendering. We have a second thread waiting for the GPU to finish. Once the GPU is done the next image is calculated. Where are the OpenGL state changes and the draw commands going now? The GPU is busy now. This leads me to the conclusion that I am missing something.
EDIT:
Suggested by Ross Vander:

One problem I see with the second diagram which may be where you're going wrong is that the GPU seems to return to CPU thread 2 even though it was CPU thread 1 that sent data to the GPU and started blocking on it. Swapping two references for the front and back buffer doesn't change which thread is blocking on the GPU.
I think the order of events should be more like this: CPU thread 1 sends data from the front buffer to the GPU to render. Simultaneously, thread 2 is writing to the back buffer. When the GPU finishes, thread 1 is free to swap the front and back buffers (assuming thread 2 is done) and then send the data to the GPU. Thread 2 writes to the back buffer again while the GPU is working.
Update: taken from https://github.com/libgdx/libgdx/wiki/Threading
Any graphics operations directly involving OpenGL need to be executed
on the rendering thread. Doing so on a different thread results in
undefined behaviour. This is due to the OpenGL context only being
active on the rendering thread. Making the context current on another
thread has its problems on a lot of Android devices, hence it is
unsupported.

Related

Process THREE.Texture in webworkers

I would like to process THREE.Texture in web workers to help post processing effects such as bloom. For instance for the bloom effect, I would first draw the scene on a THREE.Texture object and then I would like to handle the blur process in the web worker. What would be the most efficient way to pass the THREE.Texture data to the worker and create a new THREE.Texture from the data obtained from the worker. Since I would do that 60 times per second ideally, I need a fast and memory friendly way to do that (memory friendly: not to create new objects in a loop but rather re-use existing objects).
I'm aware that canvas2DContext.getImageData may be helpful but that probably is not the best way since I'd draw to canvas 60 times per second and that would slow things down.
Thanks!
PS: I should specify that in this approach, I don't intend to wait for the worker to finish processing the texture to render the final result. Since most of the objects are static I don't think that would be a big deal anyway. I wanna test it to see how it goes for the dynamic objects though.

Passing a GPU based texture to a web worker would not speed up anything in fact it would be significantly slower.
It's extremely slow to transfer memory from the GPU to the CPU (and CPU to GPU as well) relative to doing everything on the GPU. The only way to pass the contents of a texture to a worker is to ask WebGL to copy from the GPU to the CPU (using gl.readPixels on in three whatever it's wrapper for gl.readPixels is) and then transfer the result to the worker. Then in the worker all you could do is a slow CPU based blur (slower than it would have been on the GPU), then you'd have to transfer it back to the main thread only to upload it again via gl.texImage2D or telling three.js to do it for you which is also a slow operation copying the data from the CPU back to the GPU.
The fast way to apply a blur is to do it on the GPU.
Further, there is no way to share WebGL resources between the main thread and a worker nor is there likely to be anytime soon. Even if you could share the resource and then from the worker ask the GPU to do the blur that would save no time as well as for the most part GPUs don't run different operations in parallel (not generically multi-process like CPUs) so all you'd end up doing is asking the GPU to do the same amount of work.

Rendering in DirectX 11

When frame starts, I do my logical update and render after that.
In my render code I do usual stuff. I set few states, buffors, textures, and end by calling Draw.
m_deviceContext->Draw(
nbVertices,
0);
At frame end I call present to show rendered frame.
// Present the back buffer to the screen since rendering is complete.
if(m_vsync_enabled)
{
// Lock to screen refresh rate.
m_swapChain->Present(1, 0);
}
else
{
// Present as fast as possible.
m_swapChain->Present(0, 0);
}
Usual stuff. Now, when I call Draw, according to MSDN
Draw submits work to the rendering pipeline.
Does it mean that data is send to GPU and main thread (the one called Draw) continues? Or does it wait for rendering to finish?
In my opinion, only Present function should make main thread wait for rendering to finish.

There are a number of calls which can trigger the GPU to start working, Draw being one. Other's include Dispatch, CopyResource, etc. What the MSDN docs are trying to say is that stuff like PSSetShader. IASetPrimitiveTopology, etc. doesn't really do anything until you call Draw.
When you call Present that is taken as an implicit indicator of 'end of frame' but your program can often continue on with setting up rendering calls for the next frame well before the first frame is done and showing. By default, Windows will let you queue up to 3 frames ahead before blocking your CPU thread on the Present call to let the GPU catch-up--in real-time rendering you usually don't want the latency between input and render to be really high.
The fact is, however, that GPU/CPU synchronization is complicated and the Direct3D runtime is also batcning up requests to minimize kernel-call overhead so the actual work could be happing after many Draws are submitted to the command-queue. This old article gives you the flavor of how this works. On modern GPUs, you can also have various memory operations for paging in memory, setting up physical video memory areas, etc.
BTW, all this 'magic' doesn't exist with Direct3D 12 but that means the application has to do everything at the 'right' time to ensure it is both efficient and functional. The programmer is much more directly building up command-queues, triggering work on various pixel and compute GPU engines, and doing all the messy stuff that is handled a little more abstracted and automatically by Direct3 11's runtime. Even still, ultimately the video driver is the one actually talking to the hardware so they can do other kinds of optimizations as well.
The general rules of thumb here to keep in mind:
Creating resources is expensive, especially runtime shader compilation (by HLSL complier) and runtime shader blob optimization (by driver)
Copying resources to the GPU (i.e. loading texture data from the CPU memory) requires bus bandwidth that is limited in supply: Prefer to keep textures, VB, and IB data in Static buffers you reuse.
Copying resources from the GPU (i.e. moving GPU memory to CPU memory) uses a backchannel that is slower than going to the GPU: try to avoid the need for readback from the GPU
Submitting larger chunks of geometry per Draw call helps to amortize overhead (i.e. calling draw once for 10,000 triangles with the same state/shader is much faster than calling draw 10 times for a 1000 triangles each with changing state/shaders between).

Windows thread-switching latency after IO completion - microseconds or milliseconds

I am trying to determine approximate time delay (Win 7, Vista, XP) to switch threads when an IO operation completes.
What I (think I) know is that:
a) Thread contex switches are themselves computationally very fast. (By very fast, I mean typically way under 1ms, maybe even under 1us? - assuming a relatively fast, unloaded machine etc.)
b) Round robin time slice quantums are on the order of 10-15ms.
What I can't seem to find is information about the typical latency time from a (high priority) thread becoming active/signaled - via, say, a synchronous disk write completing - and that thread actually running again.
E.g., I have read in at least one place that all inactive threads remain asleep until ~10ms system quantum expires and then (assuming they are ready to go), they all get reactivated almost synchronously together. But in another place I read that the delay between when a thread completes an I/O operation and when it becomes active/signaled and runs again is measured in microseconds, not milliseconds.
My context for asking is related to capture and continuous streaming write to a RAID array of SSDs, from a high speed camera, where unless I can start a new write after a prior one has finished in well under 1ms (it would be best if under 1/10ms, on average), it will be problematic.
Any information regarding this issue would be most appreciated.
Thanks,
David

Thread context switches cost between 2,000 and 10,000 cpu cycles, so a handful of microseconds.
An I/O completion is fast when a thread is blocking on the synchronization handle that signals completion. That makes the Windows thread scheduler temporarily boost the thread priority. Which in turn makes it likely (but not guaranteed) to be chosen as the thread that gets the processor love. So that's typically microseconds, not milliseconds.
Do note that disk writes normally go through the file system cache. Which makes the WriteFile() call a simple memory-to-memory copy that doesn't block the thread. This runs at memory bus speeds, 5 gigabytes per second and up. Data is then written to the disk in a lazy fashion, the thread isn't otherwise involved or delayed by that. You'll only get slow writes when the file system cache is filled to capacity and you don't use overlapped I/O. Which is certainly a possibility if you write video streams. The amount of RAM makes a great deal of difference. And SSD controllers are not made the same. Nothing you can reason out up front, you'll have to test.

Win32 game loop that doesn't spike the CPU

There are plenty of examples in Windows of applications triggering code at fairly high and stable framerates without spiking the CPU.
WPF/Silverlight/WinRT applications can do this, for example. So can browsers and media players. How exactly do they do this, and what API calls would I make to achieve the same effect from a Win32 application?
Clock polling doesn't work, of course, because that spikes the CPU. Neither does Sleep(), because you only get around 50ms granularity at best.

They are using multimedia timers. You can find information on MSDN here
Only the view is invalidated (f.e. with InvalidateRect)on each multimedia timer event. Drawing happens in the WM_PAINT / OnPaint handler.

Actually, there's nothing wrong with sleep.
You can use a combination of QueryPerformanceCounter/QueryPerformanceFrequency to obtain very accurate timings and on average you can create a loop which ticks forward on average exactly when it's supposed to.
I have never seen a sleep to miss it's deadline by as much as 50 ms however, I've seen plenty of naive timers that drift. i.e. accumalte a small delay and conincedentally updates noticable irregular intervals. This is what causes uneven framerates.
If you play a very short beep on every n:th frame, this is very audiable.
Also, logic and rendering can be run independently of each other. The CPU might not appear to be that busy, but I bet you the GPU is hard at work.
Now, about not hogging the CPU. CPU usage is just a break down of CPU time spent by a process under a given sample (the thread schedulerer actually tracks this). If you have a target of 30 Hz for your game. You're limited to 33ms per frame, otherwise you'll be lagging behind (too slow CPU or too slow code), if you can't hit this target you won't be running at 30 Hz and if you hit it under 33ms then you can yield processor time, effectivly freeing up resources.
This might be an intresting read for you as well.
On a side note, instead of yielding time you could effecivly be doing prepwork for future computations. Some games when they are not under the heaviest of loads actually do things as sorting and memory defragmentation, a little bit here and there, adds up in the end.

How do you limit a process' CPU usage on Windows? (need code, not an app)

There is programs that is able to limit the CPU usage of processes in Windows. For example BES and ThreadMaster. I need to write my own program that does the same thing as these programs but with different configuration capabilities. Does anybody know how the CPU throttling of a process is done (code)? I'm not talking about setting the priority of a process, but rather how to limit it's CPU usage to for example 15% even if there is no other processes competing for CPU time.
Update: I need to be able to throttle any processes that is already running and that I have no source code access to.

You probably want to run the process(es) in a job object, and set the maximum CPU usage for the job object with SetInformationJobObject, with JOBOBJECT_CPU_RATE_CONTROL_INFORMATION.

Very simplified, it could work somehow like this:
Create a periodic waitable timer with some reasonable small wait time (maybe 100ms). Get a "last" value for each relevant process by calling GetProcessTimes once.
Loop forever, blocking on the timer.
Each time you wake up:
if GetProcessAffinityMask returns 0, call SetProcessAffinityMask(old_value). This means we've suspended that process in our last iteration, we're now giving it a chance to run again.
else call GetProcessTimes to get the "current" value
call GetSystemTimeAsFileTime
calculate delta by subtracting last from current
cpu_usage = (deltaKernelTime + deltaUserTime) / (deltaTime)
if that's more than you want call old_value = GetProcessAffinityMask followed by SetProcessAffinityMask(0) which will take the process offline.
This is basically a very primitive version of the scheduler that runs in the kernel, implemented in userland. It puts a process "to sleep" for a small amount of time if it has used more CPU time than what you deem right. A more sophisticated measurement maybe going over a second or 5 seconds would be possible (and probably desirable).
You might be tempted to suspend all threads in the process instead. However, it is important not to fiddle with priorities and not to use SuspendThread unless you know exactly what a program is doing, as this can easily lead to deadlocks and other nasty side effects. Think for example of suspending a thread holding a critical section while another thread is still running and trying to acquire the same object. Or imagine your process gets swapped out in the middle of suspending a dozen threads, leaving half of them running and the other half dead.
Setting the affinity mask to zero on the other hand simply means that from now on no single thread in the process gets any more time slices on any processor. Resetting the affinity gives -- atomically, at the same time -- all threads the possibility to run again.
Unluckily, SetProcessAffinityMask does not return the old mask as SetThreadAffinityMask does, at least according to the documentation. Therefore an extra Get... call is necessary.

CPU usage is fairly simple to estimate using QueryProcessCycleTime. The machine's processor speed can be obtained from HKLM\HARDWARE\DESCRIPTION\System\CentralProcessor\\~MHz (where is the processor number, one entry for each processor present). With these values, you can estimate your process' CPU usage and yield the CPU as necessary using Sleep() to keep your usage in bounds.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio