Massively Parallel algorithm to propagate pixels - algorithm

I'm designing a CUDA app to process some video. The algorithm I'm using calls for filling in blank pixels in a way that's not unlike Conway's game of life: if the pixels around another pixels are all filled and all of similar values, the specific pixel gets filled in with the surrounding value. This iterates until all the number of pixels to fix is equal to the number of pixels to fix in the last iteration (ie, when nothing else can be done).
My quandary is this: the previous and next part of the processing pipeline are both implemented in CUDA on the GPU. It would be expensive to transfer the entire image back to RAM, process it on the CPU, then transfer it back to the GPU. Even if it's slower, I would like to implement the algorithm in CUDA.
However, the nature of this problem requires synchronization between all threads to update the global image between each iteration. I thought about just calling the Kernel for each iteration multiple times, but I cannot determine when the process is "done" unless I transfer data back to the CPU between each iteration, which would introduce a large inefficiency because of the memory transfer latency through the PCI-e interface.
Does anyone with some experience with parallel algorithms have any suggestions? Thanks in advance.

It sounds like you need an extra image buffer, so that you can keep the unmodified input image in one buffer and write the processed output image into the second buffer. That way each thread can process a single output pixel (or small block of output pixels) without worrying about synchronization etc.

Related

DirectX11 - Buffer size of instanced vertices, with various size

With DirectX/C++, suppose you are drawing the same model many times to the screen. You can do this with DrawIndexedInstanced(). You need to set the size of the instance buffer when you create it:
D3D11_BUFFER_DESC_instance.ByteWidth = sizeof(struct_with_instance_data)* instance_count;
If the instance_count can vary between a low and high value, is it customary to create the buffer with the max value (max_instance_count)? And only draw what is required.
Wouldn't that permanently use a lot of memory?
Recreating the buffer is a slow solution?
What are good methods?
Thank you.
All methods have pros and cons.
Create max.count — as you pointed out you’ll consume extra memory.
Create some initial count, and implement exponential growth — you can OOM in runtime, also growing large buffers may cause spikes in the profiler.
There’s other way which may or may not work for you depending on the application. You can create reasonably large fixed-size buffer, to render more instances call DrawIndexedInstanced multiple times in a loop, replacing the data in the buffer between calls. Will work well if the source data is generated in runtime from something else, you’ll need to rework that part to produce fixed-size (except the last one) batches instead of the complete buffer. Won’t work if the data in the buffer needs to persist across frames, e.g. if you update it with a compute shader.

Rendering in DirectX 11

When frame starts, I do my logical update and render after that.
In my render code I do usual stuff. I set few states, buffors, textures, and end by calling Draw.
m_deviceContext->Draw(
nbVertices,
0);
At frame end I call present to show rendered frame.
// Present the back buffer to the screen since rendering is complete.
if(m_vsync_enabled)
{
// Lock to screen refresh rate.
m_swapChain->Present(1, 0);
}
else
{
// Present as fast as possible.
m_swapChain->Present(0, 0);
}
Usual stuff. Now, when I call Draw, according to MSDN
Draw submits work to the rendering pipeline.
Does it mean that data is send to GPU and main thread (the one called Draw) continues? Or does it wait for rendering to finish?
In my opinion, only Present function should make main thread wait for rendering to finish.
There are a number of calls which can trigger the GPU to start working, Draw being one. Other's include Dispatch, CopyResource, etc. What the MSDN docs are trying to say is that stuff like PSSetShader. IASetPrimitiveTopology, etc. doesn't really do anything until you call Draw.
When you call Present that is taken as an implicit indicator of 'end of frame' but your program can often continue on with setting up rendering calls for the next frame well before the first frame is done and showing. By default, Windows will let you queue up to 3 frames ahead before blocking your CPU thread on the Present call to let the GPU catch-up--in real-time rendering you usually don't want the latency between input and render to be really high.
The fact is, however, that GPU/CPU synchronization is complicated and the Direct3D runtime is also batcning up requests to minimize kernel-call overhead so the actual work could be happing after many Draws are submitted to the command-queue. This old article gives you the flavor of how this works. On modern GPUs, you can also have various memory operations for paging in memory, setting up physical video memory areas, etc.
BTW, all this 'magic' doesn't exist with Direct3D 12 but that means the application has to do everything at the 'right' time to ensure it is both efficient and functional. The programmer is much more directly building up command-queues, triggering work on various pixel and compute GPU engines, and doing all the messy stuff that is handled a little more abstracted and automatically by Direct3 11's runtime. Even still, ultimately the video driver is the one actually talking to the hardware so they can do other kinds of optimizations as well.
The general rules of thumb here to keep in mind:
Creating resources is expensive, especially runtime shader compilation (by HLSL complier) and runtime shader blob optimization (by driver)
Copying resources to the GPU (i.e. loading texture data from the CPU memory) requires bus bandwidth that is limited in supply: Prefer to keep textures, VB, and IB data in Static buffers you reuse.
Copying resources from the GPU (i.e. moving GPU memory to CPU memory) uses a backchannel that is slower than going to the GPU: try to avoid the need for readback from the GPU
Submitting larger chunks of geometry per Draw call helps to amortize overhead (i.e. calling draw once for 10,000 triangles with the same state/shader is much faster than calling draw 10 times for a 1000 triangles each with changing state/shaders between).

Fast comparing two pictures using OpenGL or DirectX

I need to compare 2 pictures and find pixels that are different with specified threshold.
Now I'm doing it just programmatically in for loop, it take about 3 seconds for small 600x400 picture.
I'm wondering if there a way to do it faster using OpenGL, DirectX, CUDA or something like this? So it will use GPU and not just CPU.
Notice that in output I need an array of different pixels, not just boolean value depending on if it same picture or not.
So I looked at source in delphi and it look like this:
function TCanvas.GetPixel(X, Y: Integer): TColor;
begin
RequiredState([csHandleValid]);
GetPixel := Windows.GetPixel(FHandle, X, Y);
end;
Seems like it calls WinAPI function GetPixel() each time. Probably that's why it's so slow.
So now my question is: is there a way to get whole array of pixels via WinAPI? I'm working with a screenshot that have HBITMAP, so it will not be a problem to use it with WinAPI.
Since you are using delphi , you can load the images in a TBitmap and then use the ScanLine property to fast access the pixels of a bitmap.
While it is technically possible to do such image operations using OpenGL or Direct3D this is not what they're meant for. They're drawing APIs.
CUDA or OpenCL would be better suited, but they're total overkill for something as simple as comparing images. Also the upload overhead will have negative impact on performance.
3s for such a simple image operation on a fairly small image means, that you're doing something terribly wrong. I mean: My laptop can do encoding of FullHD video to h264 in realtime, and this is about one of the most complex tasks you can do on images.
Hell yea! you can do it on GPUs using CUDA/OpenCL, rather your case exemplifies the parallelism you can achieve on GPUs. For example in CUDA you'll launch 600x400 threads on GPU that will simultaneously calculate the pixel difference of two images at each point.
In other words the two nested for loops of 600 and 400 iteration count will be removed by 240,000 threads on GPU. Thread 0 will calculate the pixel difference at point 0, thread 1 at point 1 and so on. All the threads will theoretically execute in parallel on GPU.
Shortcoming:
Although the computation on GPU will be much faster than that on CPU but you also need to first upload the image data onto GPU memory and the results after computation back to CPU memory. If the overall GPU time (including computation and memory transfers) is less than that of CPU computation time then you have a win.
HLSL/GLSL.
With them you can perform a lot of simultaneous minithreads, which performance by one is low, but it be good for pixel comparsion.

why are draw calls expensive?

assuming the texture, vertex, and shader data are already on the graphics card, you don't need to send much data to the card. there's a few bytes to identify the data, and presumably a 4x4 matrix, and some assorted other parameters.
so where is all of the overhead coming from? do the operations require a handshake of some sort with the gpu?
why is sending a single mesh containing a bunch of small models, calculated on the CPU, often faster than sending the vertex id and transformation matrices? (the second option looks like there should be less data sent, unless the models are smaller than a 4x4 matrix)
First of all, I'm assuming that with "draw calls", you mean the command that tells the GPU to render a certain set of vertices as triangles with a certain state (shaders, blend state and so on).
Draw calls aren't necessarily expensive. In older versions of Direct3D, many calls required a context switch, which was expensive, but this isn't true in newer versions.
The main reason to make fewer draw calls is that graphics hardware can transform and render triangles much faster than you can submit them. If you submit few triangles with each call, you will be completely bound by the CPU and the GPU will be mostly idle. The CPU won't be able to feed the GPU fast enough.
Making a single draw call with two triangles is cheap, but if you submit too little data with each call, you won't have enough CPU time to submit as much geometry to the GPU as you could have.
There are some real costs with making draw calls, it requires setting up a bunch of state (which set of vertices to use, what shader to use and so on), and state changes have a cost both on the hardware side (updating a bunch of registers) and on the driver side (validating and translating your calls that set state).
But the main cost of draw calls only apply if each call submits too little data, since this will cause you to be CPU-bound, and stop you from utilizing the hardware fully.
Just like Josh said, draw calls can also cause the command buffer to be flushed, but in my experience that usually happens when you call SwapBuffers, not when submitting geometry. Video drivers generally try to buffer as much as they can get away with (several frames sometimes!) to squeeze out as much parallelism from the GPU as possible.
You should read the nVidia presentation Batch Batch Batch!, it's fairly old but covers exactly this topic.
Graphics APIs like Direct3D translate their API-level calls into device-agnostic commands and queue them up in a buffer. Flushing that buffer, to perform actual work, is expensive -- both because it implies the actual work is now being performed, and because it can incur a switch from user to kernel mode on the chip (and back again), which is not that cheap.
Until the buffer is flushed, the GPU is able to do some prep work in parallel with the CPU, so long as the CPU doesn't make a blocking request (such as mapping data back to the CPU). But the GPU won't -- and can't -- prepare everything until it needs to actually draw. Just because some vertex or texture data is on the card doesn't mean it's arranged appropriately yet, and may not be arrangeable until vertex layouts are set or shaders are bound, et cetera. The bulk of the real work happens during the command flush and draw call.
The DirectX SDK has a section on accurately profiling D3D performance which, while not directly related to your question, can supply some hints as to what is and is not expensive and (in some cases) why.
More relevant is this blog post (and the follow-up posts here and here), which provide a good overview of the logical, low-level operational process of the GPU.
But, essentially (to try and directly answer your questions), the reason the calls are expensive isn't that there is necessarily a lot of data to transfer, but rather that there is a large body of work beyond just shipping data across the bus that gets deferred until the command buffer is flushed.
Short answer: The driver buffers some or all of the actual the work until you call draw. This will show up as a relatively predictable amount of time spent in the draw call, depending how much state has changed.
This is done for a few reasons:
to avoid doing unnecessary work: If you (unnecessarily) set the same state multiple times before drawing it can avoid doing expensive work each time this occurs. This actually becomes a fairly common occurrence in a large codebase, say a production game engine.
to be able to reconcile what internally are interdependent states instead of processing them immediately with incomplete information
Alternate answer(s):
The buffer the driver uses to store rendering commands is full and the app is effectively waiting for the GPU to process some of the earlier work. This will typically show up as extremely large chunks of time blocking in a random draw call within a frame.
The number of frames that the driver is allowed to buffer up has been reached and the app is waiting on the GPU to process one of them. This will typically show up as a large chunk of time blocking in the first draw call within a frame, or on Present at the end of the previous frame.

Drawing triangles with CUDA

I'm writing my own graphics library (yep, its homework:) and use cuda to do all rendering and calculations fast.
I have problem with drawing filled triangles. I wrote it such a way that one process draw one triangle. It works pretty fine when there are a lot of small triangles on the scene, but it breaks performance totally when triangles are big.
My idea is to do two passes. In first calculate only tab with information about scanlines (draw from here to there). This would be triangle per process calculation like in current algorithm. And in second pass really draw the scanlines with more than one process per triangle.
But will it be fast enough? Maybe there is some better solution?
You can check this blog: A Software Rendering Pipeline in CUDA. I don't think that's the optimal way to do it, but at least the author shares some useful sources.
Second, read this paper: A Programmable, Parallel Rendering Architecture. I think it's one of the most recent paper and it's also CUDA based.
If I had to do this, I would go with a Data-Parallel Rasterization Pipeline like in Larrabee (which is TBR) or even REYES and adapt it to CUDA:
http://www.ddj.com/architect/217200602
http://home.comcast.net/~tom_forsyth/larrabee/Standford%20Forsyth%20Larrabee%202010.zip (see the second part of the presentation)
http://graphics.stanford.edu/papers/mprast/
I suspect that you have some misconceptions about CUDA and how to use it, especially since you refer to a "process" when, in CUDA terminology, there is no such thing.
For most CUDA applications, there are two important things to getting good performance: optimizing memory access and making sure each 'active' CUDA thread in a warp performs the same operation at the same time as otehr active threads in the warp. Both of these sound like they are important for your application.
To optimize your memory access, you want to make sure that your reads from global memory and your writes to global memory are coalesced. You can read more about this in the CUDA programming guide, but it essentially means, adjacent threads in a half warp must read from or write to adjacent memory locations. Also, each thread should read or write 4, 8 or 16 bytes at a time.
If your memory access pattern is random, then you might need to consider using texture memory. When you need to refer to memory that has been read by other threads in a block, then you should make use of shared memory.
In your case, I'm not sure what your input data is, but you should at least make sure that your writes are coalesced. You will probably have to invest some non-trivial amount of effort to get your reads to work efficiently.
For the second part, I would recommend that each CUDA thread process one pixel in your output image. With this strategy, you should watch out for loops in your kernels that will execute longer or shorter depending on the per-thread data. Each thread in your warps should perform the same number of steps in the same order. The only exception to this is that there is no real performance penalty for having some threads in a warp perform no operation while the remaining threads perform the same operation together.
Thus, I would recommend having each thread check if its pixel is inside a given triangle. If not, it should do nothing. If it is, it should compute the output color for that pixel.
Also, I'd strongly recommend reading more about CUDA as it seems like you are jumping into the deep end without having a good understanding of some of the basic fundamentals.
Not to be rude, but isn't this what graphics cards are designed to do anyway? Seems like using the standard OpenGL and Direct3D APIs would make more sense.
Why not use the APIs to do your basic rendering, rather than CUDA, which is much lower-level? Then, if you wish to do additional operations that are not supported, you can use CUDA to apply them on top. Or maybe implement them as shaders.

Resources