I need to compare 2 pictures and find pixels that are different with specified threshold.
Now I'm doing it just programmatically in for loop, it take about 3 seconds for small 600x400 picture.
I'm wondering if there a way to do it faster using OpenGL, DirectX, CUDA or something like this? So it will use GPU and not just CPU.
Notice that in output I need an array of different pixels, not just boolean value depending on if it same picture or not.
So I looked at source in delphi and it look like this:
function TCanvas.GetPixel(X, Y: Integer): TColor;
begin
RequiredState([csHandleValid]);
GetPixel := Windows.GetPixel(FHandle, X, Y);
end;
Seems like it calls WinAPI function GetPixel() each time. Probably that's why it's so slow.
So now my question is: is there a way to get whole array of pixels via WinAPI? I'm working with a screenshot that have HBITMAP, so it will not be a problem to use it with WinAPI.
Since you are using delphi , you can load the images in a TBitmap and then use the ScanLine property to fast access the pixels of a bitmap.
While it is technically possible to do such image operations using OpenGL or Direct3D this is not what they're meant for. They're drawing APIs.
CUDA or OpenCL would be better suited, but they're total overkill for something as simple as comparing images. Also the upload overhead will have negative impact on performance.
3s for such a simple image operation on a fairly small image means, that you're doing something terribly wrong. I mean: My laptop can do encoding of FullHD video to h264 in realtime, and this is about one of the most complex tasks you can do on images.
Hell yea! you can do it on GPUs using CUDA/OpenCL, rather your case exemplifies the parallelism you can achieve on GPUs. For example in CUDA you'll launch 600x400 threads on GPU that will simultaneously calculate the pixel difference of two images at each point.
In other words the two nested for loops of 600 and 400 iteration count will be removed by 240,000 threads on GPU. Thread 0 will calculate the pixel difference at point 0, thread 1 at point 1 and so on. All the threads will theoretically execute in parallel on GPU.
Shortcoming:
Although the computation on GPU will be much faster than that on CPU but you also need to first upload the image data onto GPU memory and the results after computation back to CPU memory. If the overall GPU time (including computation and memory transfers) is less than that of CPU computation time then you have a win.
HLSL/GLSL.
With them you can perform a lot of simultaneous minithreads, which performance by one is low, but it be good for pixel comparsion.
Related
I am trying to design a convolution kernel code for CUDA. It will take relatively small pictures (typically for my application a 19 * 19 image)
In my research , i found most notably this paper : https://www.evl.uic.edu/sjames/cs525/final.html
I understand the concept of it, but I wonder, for small images, does using
a block by pixel of the original image, and using the threads of that block as the pixels to fetch , then do a block wide reduction, fast enough ? I made a basic implementation that makes global memory access coalescent, so, is it a good design for small pictures ? Or should I follow the "traditional" method ?
It all depends upon your eventual application for your program. If you intend to only convolute a few "relatively small pictures", as you mention, then a naive approach should be sufficient. In fact, a serial approach may even be faster due to memory transfer overhead between the CPU and GPU if you're not processing much data. I would recommend first writing the kernel which accesses global memory, as you mention, and if you will be working with a larger dataset in the future, it would make sense to attempt the "traditional" approach as well, and compare runtimes.
I'm looking into the feasibility of GPU synthesized audio, where each thread renders a sample. This puts some interesting restrictions on what algorithms can be used - any algorithm that refers to a previous set of samples cannot be implemented in this fashion.
Filtering is one of those algorithms. Bandpass, lowpass, or highpass - all of them require looking to the last few samples generated in order to compute the result. This can't be done because those samples haven't been generated yet.
This makes synthesizing bandlimited waveforms difficult. One approach is additive synthesis of partials using the fourier series. However, this runs at O(n) time, and is especially slow on a GPU to the point that the gain of parallelism is lost. If there were an algorithm that ran at O(1) time, this would eliminate branching AND be up to 1000x faster when dealing with the audible range.
I'm specifically looking for something like a DSF for a sawtooth. I've been trying to work out a simplification of the fourier series by hand, but that's really, really hard. Mainly because it involves harmonic numbers, AKA the only singularity of the Riemann-Zeta function.
Is a constant-time algorithm achievable? If not, can it be proven that it isn't?
Filtering is one of those algorithms. Bandpass, lowpass, or highpass - all of them require looking to the last few samples generated in order to compute the result. This can't be done because those samples haven't been generated yet.
That's not right. IIR filters do need previous results, but FIR filters only need previous input; that is pretty typical for the things that GPUs were designed to do, so it's not likely a problem to let every processing core access let's say 64 input samples to produce one output sample -- in fact, the cache architectures that Nvidia and AMD use lend themselves to that.
Is a constant-time algorithm achievable? If not, can it be proven that it isn't?
It is! In two aspects:
as mentioned above, FIR filters only need multiple samples of immutable input, so they can be parallelized heavily without problems, and
even if you need to calculate your input first, and would like to parallelize that (I don't see a reason for that -- generating a sawtooth is not CPU-limited, but memory bandwidth limited), every core could simply calculate the last N samples -- sure, there's N-1 redundant operations, but as long as your number of cores is much bigger than your N, you will still be faster, and every core will have constant run time.
Comments on your approach:
I'm looking into the feasibility of GPU synthesized audio, where each thread renders a sample.
From a higher-up perspective, that sounds too fine-granular. I mean, let's say you have 3000 stream processors (high-end consumer GPU). Assuming you have a sampling rate of 44.1kHz, and assuming each of these processors does only one sample, letting them all run once only gives you 1/14.7 of a second of audio (mono). Then you'd have to move on to the next part of audio.
In other words: There's bound to be much much more samples than processors. In these situations, it's typically way more efficient to let one processor handle a sequence of samples; for example, if you want to generate 30s of audio, that'd be 1.323MS (amples). Simply splitting the problem into 3000 chunks, one for each processor, and giving each of them the 44100*30/3000=441 samples they should process plus 64 samples of "history" before the first of their "own" samples will still easily fit into local memory.
Yet another thought:
I'm coming from a software defined radio background, where there's usually millions of samples per second, rather than a few kHz of sampling rate, in real time (i.e. processing speed > sampling rate). Still, doing computation on the GPU only pays for the more CPU-intense tasks, because there's significant overhead in exchanging data with the GPU, and CPUs nowadays are blazingly fast. So, for your relatively simple problem, it might never work faster to do things on the GPU compared to optimizing them on the CPU; things of course look different if you've got to process lots of samples, or a lot of streams, at once. For finer-granular tasks, the problem of filling a buffer, moving it to the GPU, and getting the result buffer back into your software usually kills the advantage.
Hence, I'd like to challenge you: Download the GNU Radio live DVD, burn it to a DVD or write it to a USB stick (you might as well run it in a VM, but that of course reduces performance if you don't know how to optimize your virtualizer; really - try it from a live medium), run
volk_profile
to let the VOLK library test which algorithms work best on your specific machine, and then launch
gnuradio-companion
And then, run open the following two signal processing flow graphs:
"classical FIR": This single-threaded implementation of the FIR filter yields about 50MSamples/s on my CPU.
FIR Filter implemented with the FFT, running on 4 threads: This implementation reaches 160MSamples/s (!!) on my CPU alone.
Sure, with the help of FFTs on my GPU, I could be faster, but the thing here is: Even with the "simple" FIR filter, I can, with a single CPU core, get 50 Megasamples out of my machine -- meaning that, with a 44.1kHz audio sampling rate, per single second I can process roughly 19 minutes of audio. No copying in and out of host RAM. No GPU cooler spinning up. It might really not be worth optimizing further. And if you optimize and take the FFT-Filter approach: 160MS/s means roughly one hour of audio per processing second, including sawtooth generation.
I am working on some signal processing code in SciPy, and am now trying to use a numerical optimizer to tune it. Unfortunately, as these things go, it is turning out to be quite a slow process.
The operations I must perform for this optimization are the following:
Load a large 1-d data file (~ 120000 points)
Run optimizer, which:
Executes a signal processing operation, does not modify original data, produces 120000 new data points.
Examines difference between original signal and new signal using various operations,
One of which includes FFT-based convolution
Generates a single "error" value to summarise the result -- this is what should be minimized
Looks at error and re-runs operation with different parameters
The signal processing and error functions take under 3 seconds, but unfortunately doing it 50,000 times takes much longer. I am experimenting with various more efficient optimisation algorithms, but no matter what it's going to take thousands of iterations.
I have parallelised a couple of the optimisers I'm trying using CPU threads, which wasn't too difficult since the optimiser can easily perform several scheduled runs at once on separate threads using ThreadPool.map.
But this is only about a 2x speed-up on my laptop, or maybe 8x on a multicore computer. My question is, is this an application for which I could make use of GPU processing? I have already translated some parts of the code to C, and I could imagine using OpenCL to create a function from an array of parameters to an array of error values, and running this hundreds of times at once. -- Even if it performs the sequential processing part slowly, getting all the results in one shot would be amazing.
However, my guess is that the memory requirements (loading up a large file and producing a temporary one of equal size to generate every data point) would make it difficult to run the whole algorithm in an OpenCL kernel. I don't have much experience with GPU processing and writing CUDA/OpenCL code, so I don't want to set about learning the ins and outs if there is no hope in making it work.
Any advice?
Do you need to produce all 120,000 new points before analysing the difference? Could you calculate the new point, then decide for that point if you are converging?
How big are the points? A $50 graphics card today has 1Gb of memory - should be plenty for 120K points. I'm not as familiar with openCL as Cuda but there may also be limits on how much of this is texture memory vs general memory etc.
edit: More familiar with CUDA than OpenCL but this probably applies to both.
The memory on GPUs is a bit more complex but very flexible, you have texture memory that can be read by the GPU kernel and has some very clever cache features to make access to values in a 2d and 3d arrays very fast. There is openGL memory that you can write to for display and there is a limited (16-64k ?) cache per thread
Although transfers from main memory to the GPU are relatively slow ( few GB/s) the internal memory bus on the graphics card is 20x as fast as this
I'm designing a CUDA app to process some video. The algorithm I'm using calls for filling in blank pixels in a way that's not unlike Conway's game of life: if the pixels around another pixels are all filled and all of similar values, the specific pixel gets filled in with the surrounding value. This iterates until all the number of pixels to fix is equal to the number of pixels to fix in the last iteration (ie, when nothing else can be done).
My quandary is this: the previous and next part of the processing pipeline are both implemented in CUDA on the GPU. It would be expensive to transfer the entire image back to RAM, process it on the CPU, then transfer it back to the GPU. Even if it's slower, I would like to implement the algorithm in CUDA.
However, the nature of this problem requires synchronization between all threads to update the global image between each iteration. I thought about just calling the Kernel for each iteration multiple times, but I cannot determine when the process is "done" unless I transfer data back to the CPU between each iteration, which would introduce a large inefficiency because of the memory transfer latency through the PCI-e interface.
Does anyone with some experience with parallel algorithms have any suggestions? Thanks in advance.
It sounds like you need an extra image buffer, so that you can keep the unmodified input image in one buffer and write the processed output image into the second buffer. That way each thread can process a single output pixel (or small block of output pixels) without worrying about synchronization etc.
I'm writing my own graphics library (yep, its homework:) and use cuda to do all rendering and calculations fast.
I have problem with drawing filled triangles. I wrote it such a way that one process draw one triangle. It works pretty fine when there are a lot of small triangles on the scene, but it breaks performance totally when triangles are big.
My idea is to do two passes. In first calculate only tab with information about scanlines (draw from here to there). This would be triangle per process calculation like in current algorithm. And in second pass really draw the scanlines with more than one process per triangle.
But will it be fast enough? Maybe there is some better solution?
You can check this blog: A Software Rendering Pipeline in CUDA. I don't think that's the optimal way to do it, but at least the author shares some useful sources.
Second, read this paper: A Programmable, Parallel Rendering Architecture. I think it's one of the most recent paper and it's also CUDA based.
If I had to do this, I would go with a Data-Parallel Rasterization Pipeline like in Larrabee (which is TBR) or even REYES and adapt it to CUDA:
http://www.ddj.com/architect/217200602
http://home.comcast.net/~tom_forsyth/larrabee/Standford%20Forsyth%20Larrabee%202010.zip (see the second part of the presentation)
http://graphics.stanford.edu/papers/mprast/
I suspect that you have some misconceptions about CUDA and how to use it, especially since you refer to a "process" when, in CUDA terminology, there is no such thing.
For most CUDA applications, there are two important things to getting good performance: optimizing memory access and making sure each 'active' CUDA thread in a warp performs the same operation at the same time as otehr active threads in the warp. Both of these sound like they are important for your application.
To optimize your memory access, you want to make sure that your reads from global memory and your writes to global memory are coalesced. You can read more about this in the CUDA programming guide, but it essentially means, adjacent threads in a half warp must read from or write to adjacent memory locations. Also, each thread should read or write 4, 8 or 16 bytes at a time.
If your memory access pattern is random, then you might need to consider using texture memory. When you need to refer to memory that has been read by other threads in a block, then you should make use of shared memory.
In your case, I'm not sure what your input data is, but you should at least make sure that your writes are coalesced. You will probably have to invest some non-trivial amount of effort to get your reads to work efficiently.
For the second part, I would recommend that each CUDA thread process one pixel in your output image. With this strategy, you should watch out for loops in your kernels that will execute longer or shorter depending on the per-thread data. Each thread in your warps should perform the same number of steps in the same order. The only exception to this is that there is no real performance penalty for having some threads in a warp perform no operation while the remaining threads perform the same operation together.
Thus, I would recommend having each thread check if its pixel is inside a given triangle. If not, it should do nothing. If it is, it should compute the output color for that pixel.
Also, I'd strongly recommend reading more about CUDA as it seems like you are jumping into the deep end without having a good understanding of some of the basic fundamentals.
Not to be rude, but isn't this what graphics cards are designed to do anyway? Seems like using the standard OpenGL and Direct3D APIs would make more sense.
Why not use the APIs to do your basic rendering, rather than CUDA, which is much lower-level? Then, if you wish to do additional operations that are not supported, you can use CUDA to apply them on top. Or maybe implement them as shaders.