Implement large median filter with OpenCL - filter

In an algorithm we are using a median filter with a big window of 257x257 on a UInt16 image. My task is to implement that algorithm with OpenCL on a GPU.
In fact I do not only need the median but in some cases also the 0.001, 0.02 and 0.999 quantiles.
The obvious approach is to have a OpenCL kernel running on every output pixel where the kernel loads all pixels in the window form the input image to local memory, then sorts these values and finally computes the quantiles.
Now the problem is with a window size of 257x257 this approach would need at least 132098 bytes of local memory. But local memory is very limited. My Quadro K4000 has only 49152 bytes.
So what would be a good approach to implement such a median filter on the GPU?
A solution for CUDA would also be acceptable. But I guess the underlying problem would be the same.

Related

avx512 strided gather with arbitrary stride

I know in AVX512 you can perform strided gathers with strides of 1,2,4,8. However what if I have an arbitrary stride that can be anywhere between 10-1000? The stride is known at compile time. I understand then the instruction won't be the bottleneck, the memory probably will be. Is _mm512_set_ps the most effective way to do this?
strided gathers with strides of 1,2,4,8
No, there's no special support for that; maybe you're thinking of ARM/ARM64 NEON vld4 4-way deinterleave?
In x86 you can use 1,2,4, or 8 as a scale factors for an index vector for vpgatherdd / vpgatherdps, but if you just want every 2nd element it's better to manually shuffle (e.g. _mm512_permutex2var_ps to grab alternate floats from 2 input vectors), getting many useful elements with one wide load instead of accessing cache once per element.
But in your case, with a minimum stride of 10, at most 2 elements will come from the same 16 x 32-bit 512-bit vector, and with wider strides not even one per vector.
So you can use vpgatherdps with _mm512_add_epi32(idx, _mm512_set1_epi32(16 * stride)) in a loop. Or better, just use a fixed vector of indices and increment the base pointer. You might generate that vector of indices with _mm512_mullo_epi32(_mm512_setr_epi32(0,1,2,3,...,15), _mm512_set1_epi32(stride)). Since a float is 4 bytes wide, use a scale factor of 4 with your gathers.
Even if you need to handle huge arrays, incrementing the pointer instead of the vector elements avoids any need for 64-bit indices, as well as minimizing the number of vector uops. (Valuable when using 512-bit vectors on current CPUs.)
IIRC, Intel's optimization manual has a section about strided loads and the tradeoff in manual gather vs. using gather instructions. Gather instructions become relatively better the wider your vectors are (2/clock load throughput but only 1/clock shuffle throughput for most shuffles), so especially for 512-bit vectors its likely a win to use vector shuffles.

Cache blocking brings no improvement for image filter on ARM

I'm experimenting with cache blocking. To do that, I implemented 2 convolution based smoothing algorithms. The gaussian kernel I'm using looks like this:
The first algorithm is just the simple double for loop, looping from left to right, top to bottom as shown below.
Image source: (https://people.engr.ncsu.edu/efg/521/f02/common/lectures/notes/lec9.html)
In the second algorithm I tried to play with cache blocking by spliting the loops into chunks, which became something like the following. I used a BLOCK size of 512x512.
Image source: (https://people.engr.ncsu.edu/efg/521/f02/common/lectures/notes/lec9.html)
I'm running the code on a raspberry pi 3B+, which has a Cortex-A53 with 32KB of L1 and 256KB of L2, I believe. I ran the two algorithms with different image sizes (2048x1536, 6000x4000, 12000x8000, 16000x12000. 8bit gray scale images). But across different image sizes, I saw the run time being very similar.
The question is shouldn't the first algorithm experience access latency which the second should not, especially when using large size image (like 12000x8000). Base on the description of cache blocking in this link, when processing data at the end of image rows using the 1st algorithm, the data at the beginning of the rows should have been evicted from the L1 cache. Using 12000x8000 size image as an example, since we are using 5x5 kernel, 5 rows of data is need, which is 12000x5=60KB, already larger than the 32KB L1 size. When we start processing data for a new row, 4 rows of previous data are still needed but they are likely gone in L1 so needs to be re-fetched. But for the second algorithm it shouldn't have this problem because the block size is small. Can anyone please tell me what am I missing?
I also profiled the algorithm using oprofile with the following data:
Algorithm 1
event
count
L1D_CACHE_REFILL
13,933,254
PREFETCH_LINEFILL
13,281,559
Algorithm 2
event
count
L1D_CACHE_REFILL
9,456,369
PREFETCH_LINEFILL
8,725,250
So it looks like the 1st algorithm does have more cache miss compared to the second, reflecting by the L1D_CACHE_REFILL counts. But it also has higher data prefetching rate, which maybe due to the simple behavior of the loop. So is the whole story of cache blocking not taking into account data prefetching?
Conceptually, you're right blocking will reduce cache misses by keeping the input window in cache.
I suspect the main reason you're not seeing a speedup is because the cache is prefetching from all 5 input rows. Your performance counters show more prefetch loads in the unblocked implementation. I suspect many textbook examples are out of date since cache prefetching has kept getting better. Intel's L2 cache can detect and prefetch from up to 16 linear streams about 10 years ago, I think.
Assume the filter takes 5 * 5 cycles. So that would be 20.8 ns = 25 / 1.2GHz on RPI3. The IO cost will be reading a 5 high column of new input pixels. The amortized IO cost will be 5 bytes / 20.8ns = 229 MiB/s, which is much less than the ~2 GiB/s DRAM bandwidth. So in theory, the relatively slow computation combined with prefetching (I'm not certain how effective) means that memory access isn't a bottleneck.
Try increasing the filter height. The cache can only detect and prefetch from a certain # streams. Or try vectorizing the computation so that memory access becomes the bottleneck.

How to calculate a two-dimensional FFT of a large matrix with CUDA?

Now I have a large 16K*16K matrix,and the global memory is not enough.How to calculate the two-dimensional FFT of the matrix?
Perhaps oversubscribing with unified memory works with cuFFT?
https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
You can also do the FFT for the rows and columns separately and move data in between to and from host memory.
Do you need the full result matrix? How much memory do you have on CPU and on GPU? Are the inputs/outputs complex values? What precision do you need (is 16 bits enough)? Is the computation time-critical? Do you also want to process even larger matrices?

why executation time of tf.nn.conv2d function different while the multiply times are the same?

I am using tensorflow to build cnn net in image classification experiment,I found such phenomenon as:
operation 1:tf.nn.conv2d(x, [3,3,32,32], strides=[1,1,1,1], padding='SAME')
the shape of x is [128,128,32],means convolution using 3x3 kernel on x,both input channels and output channels are 32,the total multiply times is
3*3*32*32*128*128=150994944
operation 2:tf.nn.conv2d(x, [3,3,64,64], strides=[1,1,1,1], padding='SAME')
the shape of x is [64,64,64],means convolution using 3x3 kernel on x,both input channels and output channels are 64,the total multiply times is
3*3*64*64*64*64=150994944
In contrast with operation 1,the feature map size of operation 2 scale down to 1/2 and the channel number doubled. The multiply times are the same so the running time should be same.But in practice the running time of operation 1 is longer than operation 2.
My measure method was shown below
eliminate an convolution of operation 1,the training time for one epoch reduced 23 seconds,means the running time of operation 1 is 23 seconds.
eliminate an convolution of operation 2,the training time for one epoch reduced 13 seconds,means the running time of operation 2 is 13 seconds.
the phenomenon can reproduction every time。
My gpu is nvidia gtx980Ti,os is ubuntu 16.04。
So that comes the question: Why the running time of operation 1 was longer than operation 2?
If I had to guess it has to do with how the image is ordered in memory. Remember that in memory everything is stored in a flattened format. This means that if you have a tensor of shape [128, 128, 32], the 32 features/channels are stored next to eachover. Then all of the rows, then all of the columns. https://en.wikipedia.org/wiki/Row-major_order
Accessing closely packed memory is very important to performance especially on a GPU which has a large memory bus and is optimized for aligned in order memory access. In case with the larger image you have to skip around the image more and the memory access is more out of order. In case 2 you can do more in order memory access which gives you more speed. Multiplications are very fast operations. I bet with a convolution memory access if the bottleneck which limits performance.
chasep255's answer is good and probably correct.
Another possibility (or alternative way of thinking about chasep255's answer) is to consider how caching (all the little hardware tricks that can speed up memory fetches, address mapping, etc) could be producing what you see...
You have basically two things: a stream of X input data and a static filter matrix. In case 1, you have 9*1024 static elements, in case 2 you have 4 times as many. Both cases have the same total multiplication count, but in case 2 the process is finding more of its data where it expects (i.e. where it was last time it was asked for.) Net result: less memory access stalls, more speed.

Calculation of gflops for double precision

I have a device providing the peak GFLOPS specs and I want to measure how far my program is away from it. Since all the data I used was double precision, should I multiply the number of ops by 2 to get the GLOPS value and do the comparison?
No. 1 double-precision floating-point operation is still one floating-point operation.
Most GPUs process double-precision data slower than single-precision, so there should be two specifications of peak GFLOPS. One peak single-precision GFLOPS spec, and one peak double-precision GFLOPS spec. Sometimes it is broken done further, so that (for example) peak division performance is listed separately from peak addition performance.
" ... , should I multiply the number of ops by 2 to get the GLOPS value and do the comparison?"
No, not for any (but one) of these Cards: http://www.geeks3d.com/20140305/amd-radeon-and-nvidia-geforce-fp32-fp64-gflops-table-computing/ .
Note that the ratio varies from 1/24th to as good as 1/3 in most cases, also note that the 'Workstation Graphics Card' has a ratio 1/2 - it is specifically designed that way to improve DP performance.
You need to read the Specs for the Hardware in your Card and determine what performance hit you should expect from switching to DP from SP. There will be a small additional amount of overhead to load the additional precision into the Registers (Memory where the Hardware will perform the Operation on) and to retrieve the additional precision after each Operation.

Resources