Design of convolution kernel CUDA

Design of convolution kernel CUDA - algorithm

I am trying to design a convolution kernel code for CUDA. It will take relatively small pictures (typically for my application a 19 * 19 image)
In my research , i found most notably this paper : https://www.evl.uic.edu/sjames/cs525/final.html
I understand the concept of it, but I wonder, for small images, does using
a block by pixel of the original image, and using the threads of that block as the pixels to fetch , then do a block wide reduction, fast enough ? I made a basic implementation that makes global memory access coalescent, so, is it a good design for small pictures ? Or should I follow the "traditional" method ?

It all depends upon your eventual application for your program. If you intend to only convolute a few "relatively small pictures", as you mention, then a naive approach should be sufficient. In fact, a serial approach may even be faster due to memory transfer overhead between the CPU and GPU if you're not processing much data. I would recommend first writing the kernel which accesses global memory, as you mention, and if you will be working with a larger dataset in the future, it would make sense to attempt the "traditional" approach as well, and compare runtimes.

Related

Julia parallel speedup performance for large scale computations

General context:
I have developed a fairly large Navier-Stokes (finite difference) solver written in FORTRAN90. It has adaptive grids (hence load-balance issue), and I have tried various techniques (MPI, OpenMP & OpenMP-MPI hyrbid) to parallelize it. However, it does not scale good enough i.e. according to Amdahl's law it runs 96-97% of the computations in parallel. Also, the general size of the mesh is a couple of hundred million points, which would require to increase later in the future.
Query:
Now, I am thinking of switching to Julia, since it has become very tedious to maintain and add further functionalities to the existing code.
The problem is that I am unable to find a good answer about the parallel performance of Julia. I have searched on the internet as well as have watched a lot of youtube videos. What I have noticed is that most people say that Julia is very much suitable for the parallel computing, some even provide a bar chart showing the reduction in the elapsed time compared to the serial code. However, some of the answers/videos are quite old, which make them a little unreliable due to the growing nature of this new language.
Therefore, I would like to know if the language has the ability to scale even for a few thousand cores?
Extra information:
I am still trying hard to improve the speedup of my existing code to achieve almost linear performance for a couple of thousand cores. The solver needs to exchange overlapping points 3-4 times per timestep. Hence, it involves a huge communication overhead. However, the non-adaptive grid version of the code easily scales up to 20k cores.
I have also read somewhere that Julia does not use InfiniBand standard for data communication in parallel.

The following paper has scaling results for pde constrained parameter estimation problems but not up to anywhere near the number of cores you seem to be interested in: https://arxiv.org/abs/1606.07399. I haven't seen any examples going up to thousands of cores.
Re infiniband: By default Julia uses shared memory for communication within a node and TCP/IP across nodes, so by default infiniband is not supported. However, the language allows for the implementation of custom transports and I imagine someone will add infiniband support at some point but I couldn't find any implementations with a quick google search.

Is this simulation appropriate for CUDA or OpenCL?

I'm asking on behalf of a friend working in numerical astrophysics.
Basically what he's doing is simulating a cloud of gas. There are a finite number of cells and the timestep is defined such that gas cannot cross more than one cell each step. Each cell has properties like density and temperature. Each timestep, these (and position) need to be calculated. It's mainly position that's the issue I believe as that is affected primarily by the interactions of gravity among the cells, all of which affect each other.
At the moment he's running this on a cluster of ~150 nodes but I wondered, if it's parallelizable like this, could it be run faster on a few GPUs with CUDA? At the moment it takes him a couple of days to finish a simulation. As GPUs generally have ~500 cores, it seemed like they could provide a boost.
Maybe I'm totally wrong.

Yes this sounds like a decent application for a GPU. GPU processing is most effective when it's running the same function on a large data set. If you've already got it running in parallel on a cluster computer, I'd say write it and test it on a single graphics card, and see if that's an improvement on a single cluster, then scale accordingly.

The task you describe is a good fit for the GPU. GPUs have successfully been used for dramatically improving the performance in areas such as particle, aerodynamics and fluid simulations.

Without knowing more details about the simulation it's impossible to say for sure whether it would gain a performance boost. Broadly speaking, algorithms that are memory bound ( that is, relatively few arithmetic operations per memory transaction ) tend to benefit most from offloading to the GPU.
For astrophysics simulations specifically, the following link may be of use : http://www.astrogpu.org/

parallel processing multiple evaluations of a sequential task on a large data set -- a task for GPU computing?

I am working on some signal processing code in SciPy, and am now trying to use a numerical optimizer to tune it. Unfortunately, as these things go, it is turning out to be quite a slow process.
The operations I must perform for this optimization are the following:
Load a large 1-d data file (~ 120000 points)
Run optimizer, which:
Executes a signal processing operation, does not modify original data, produces 120000 new data points.
Examines difference between original signal and new signal using various operations,
One of which includes FFT-based convolution
Generates a single "error" value to summarise the result -- this is what should be minimized
Looks at error and re-runs operation with different parameters
The signal processing and error functions take under 3 seconds, but unfortunately doing it 50,000 times takes much longer. I am experimenting with various more efficient optimisation algorithms, but no matter what it's going to take thousands of iterations.
I have parallelised a couple of the optimisers I'm trying using CPU threads, which wasn't too difficult since the optimiser can easily perform several scheduled runs at once on separate threads using ThreadPool.map.
But this is only about a 2x speed-up on my laptop, or maybe 8x on a multicore computer. My question is, is this an application for which I could make use of GPU processing? I have already translated some parts of the code to C, and I could imagine using OpenCL to create a function from an array of parameters to an array of error values, and running this hundreds of times at once. -- Even if it performs the sequential processing part slowly, getting all the results in one shot would be amazing.
However, my guess is that the memory requirements (loading up a large file and producing a temporary one of equal size to generate every data point) would make it difficult to run the whole algorithm in an OpenCL kernel. I don't have much experience with GPU processing and writing CUDA/OpenCL code, so I don't want to set about learning the ins and outs if there is no hope in making it work.
Any advice?

Do you need to produce all 120,000 new points before analysing the difference? Could you calculate the new point, then decide for that point if you are converging?
How big are the points? A $50 graphics card today has 1Gb of memory - should be plenty for 120K points. I'm not as familiar with openCL as Cuda but there may also be limits on how much of this is texture memory vs general memory etc.
edit: More familiar with CUDA than OpenCL but this probably applies to both.
The memory on GPUs is a bit more complex but very flexible, you have texture memory that can be read by the GPU kernel and has some very clever cache features to make access to values in a 2d and 3d arrays very fast. There is openGL memory that you can write to for display and there is a limited (16-64k ?) cache per thread
Although transfers from main memory to the GPU are relatively slow ( few GB/s) the internal memory bus on the graphics card is 20x as fast as this

Drawing triangles with CUDA

I'm writing my own graphics library (yep, its homework:) and use cuda to do all rendering and calculations fast.
I have problem with drawing filled triangles. I wrote it such a way that one process draw one triangle. It works pretty fine when there are a lot of small triangles on the scene, but it breaks performance totally when triangles are big.
My idea is to do two passes. In first calculate only tab with information about scanlines (draw from here to there). This would be triangle per process calculation like in current algorithm. And in second pass really draw the scanlines with more than one process per triangle.
But will it be fast enough? Maybe there is some better solution?

You can check this blog: A Software Rendering Pipeline in CUDA. I don't think that's the optimal way to do it, but at least the author shares some useful sources.
Second, read this paper: A Programmable, Parallel Rendering Architecture. I think it's one of the most recent paper and it's also CUDA based.
If I had to do this, I would go with a Data-Parallel Rasterization Pipeline like in Larrabee (which is TBR) or even REYES and adapt it to CUDA:
http://www.ddj.com/architect/217200602
http://home.comcast.net/~tom_forsyth/larrabee/Standford%20Forsyth%20Larrabee%202010.zip (see the second part of the presentation)
http://graphics.stanford.edu/papers/mprast/

I suspect that you have some misconceptions about CUDA and how to use it, especially since you refer to a "process" when, in CUDA terminology, there is no such thing.
For most CUDA applications, there are two important things to getting good performance: optimizing memory access and making sure each 'active' CUDA thread in a warp performs the same operation at the same time as otehr active threads in the warp. Both of these sound like they are important for your application.
To optimize your memory access, you want to make sure that your reads from global memory and your writes to global memory are coalesced. You can read more about this in the CUDA programming guide, but it essentially means, adjacent threads in a half warp must read from or write to adjacent memory locations. Also, each thread should read or write 4, 8 or 16 bytes at a time.
If your memory access pattern is random, then you might need to consider using texture memory. When you need to refer to memory that has been read by other threads in a block, then you should make use of shared memory.
In your case, I'm not sure what your input data is, but you should at least make sure that your writes are coalesced. You will probably have to invest some non-trivial amount of effort to get your reads to work efficiently.
For the second part, I would recommend that each CUDA thread process one pixel in your output image. With this strategy, you should watch out for loops in your kernels that will execute longer or shorter depending on the per-thread data. Each thread in your warps should perform the same number of steps in the same order. The only exception to this is that there is no real performance penalty for having some threads in a warp perform no operation while the remaining threads perform the same operation together.
Thus, I would recommend having each thread check if its pixel is inside a given triangle. If not, it should do nothing. If it is, it should compute the output color for that pixel.
Also, I'd strongly recommend reading more about CUDA as it seems like you are jumping into the deep end without having a good understanding of some of the basic fundamentals.

Not to be rude, but isn't this what graphics cards are designed to do anyway? Seems like using the standard OpenGL and Direct3D APIs would make more sense.
Why not use the APIs to do your basic rendering, rather than CUDA, which is much lower-level? Then, if you wish to do additional operations that are not supported, you can use CUDA to apply them on top. Or maybe implement them as shaders.

Algorithms FPGAs dominate CPUs on

For most of my life, I've programmed CPUs; and although for most algorithms, the big-Oh running time remains the same on CPUs / FPGAs, the constants are quite different (for example, lots of CPU power is wasted shuffling data around; whereas for FPGAs it's often compute bound).
I would like to learn more about this -- anyone know of good books / reference papers / tutorials that deals with the issue of:
what tasks do FPGAs dominate CPUs on (in terms of pure speed)
what tasks do FPGAs dominate CPUs on (in terms of work per jule)
Note: marked community wiki

[no links, just my musings]
FPGAs are essentially interpreters for hardware!
The architecture is like dedicated ASICs, but to get rapid development, and you pay a factor of ~10 in frequency and a [don't know, at least 10?] factor in power efficiency.
So take any task where dedicated HW can massively outperform CPUs, divide by the FPGA 10/[?] factors, and you'll probably still have a winner. Typical qualities of such tasks:
Massive opportunities for fine-grained parallelism.
(Doing 4 operations at once doesn't count; 128 does.)
Opportunity for deep pipelining.
This is also a kind of parallelism, but it's hard to apply it to a
single task, so it helps if you can get many separate tasks to
work on in parallel.
(Mostly) Fixed data flow paths.
Some muxes are OK, but massive random accesses are bad, cause you
can't parallelize them. But see below about memories.
High total bandwidth to many small memories.
FPGAs have hundreds of small (O(1KB)) internal memories
(BlockRAMs in Xilinx parlance), so if you can partition you
memory usage into many independent buffers, you can enjoy a data
bandwidth that CPUs never dreamed of.
Small external bandwidth (compared to internal work).
The ideal FPGA task has small inputs and outputs but requires a
lot of internal work. This way your FPGA won't starve waiting for
I/O. (CPUs already suffer from starving, and they alleviate it
with very sophisticated (and big) caches, unmatchable in FPGAs.)
It's perfectly possible to connect a huge I/O bandwidth to an
FPGA (~1000 pins nowdays, some with high-rate SERDESes) -
but doing that requires a custom board architected for such
bandwidth; in most scenarios, your external I/O will be a
bottleneck.
Simple enough for HW (aka good SW/HW partitioning).
Many tasks consist of 90% irregular glue logic and only 10%
hard work ("kernel" in the DSP sense). If you put all that
onto an FPGA, you'll waste precious area on logic that does no
work most of the time. Ideally, you want all the muck
to be handled in SW and fully utilize the HW for the kernel.
("Soft-core" CPUs inside FPGAs are a popular way to pack lots of
slow irregular logic onto medium area, if you can't offload it to
a real CPU.)
Weird bit manipulations are a plus.
Things that don't map well onto traditional CPU instruction sets,
such as unaligned access to packed bits, hash functions, coding &
compression... However, don't overestimate the factor this gives
you - most data formats and algorithms you'll meet have already
been designed to go easy on CPU instruction sets, and CPUs keep
adding specialized instructions for multimedia.
Lots of Floating point specifically is a minus because both
CPUs and GPUs crunch them on extremely optimized dedicated silicon.
(So-called "DSP" FPGAs also have lots of dedicated mul/add units,
but AFAIK these only do integers?)
Low latency / real-time requirements are a plus.
Hardware can really shine under such demands.
EDIT: Several of these conditions — esp. fixed data flows and many separate tasks to work on — also enable bit slicing on CPUs, which somewhat levels the field.

Well the newest generation of Xilinx parts just anounced brag 4.7TMACS and general purpose logic at 600MHz. (These are basically Virtex 6s fabbed on a smaller process.)
On a beast like this if you can implement your algorithms in fixed point operations, primarily multiply, adds and subtracts, and take advantage of both Wide parallelism and Pipelined parallelism you can eat most PCs alive, in terms of both power and processing.
You can do floating on these, but there will be a performance hit. The DSP blocks contain a 25x18 bit MACC with a 48bit sum. If you can get away with oddball formats and bypass some of the floating point normalization that normally occurs you can still eek out a truck load of performance out of these. (i.e. Use the 18Bit input as strait fixed point or float with a 17 bit mantissia, instead of the normal 24 bit.) Doubles floats are going to eat alot of resources so if you need that, you probably will do better on a PC.
If your algorithms can be expressed as in terms of add and subtract operations, then the general purpose logic in these can be used to implement gazillion adders. Things like Bresenham's line/circle/yadda/yadda/yadda algorithms are VERY good fits for FPGA designs.
IF you need division... EH... it's painful, and probably going to be relatively slow unless you can implement your divides as multiplies.
If you need lots of high percision trig functions, not so much... Again it CAN be done, but it's not going to be pretty or fast. (Just like it can be done on a 6502.) If you can cope with just using a lookup table over a limited range, then your golden!
Speaking of the 6502, a 6502 demo coder could make one of these things sing. Anybody who is familiar with all the old math tricks that programmers used to use on the old school machine like that will still apply. All the tricks that modern programmer tell you "let the libary do for you" are the types of things that you need to know to implement maths on these. If yo can find a book that talks about doing 3d on a 68000 based Atari or Amiga, they will discuss alot of how to implement stuff in integer only.
ACTUALLY any algorithms that can be implemented using look up tables will be VERY well suited for FPGAs. Not only do you have blockrams distributed through out the part, but the logic cells themself can be configured as various sized LUTS and mini rams.
You can view things like fixed bit manipulations as FREE! It's simply handle by routing. Fixed shifts, or bit reversals cost nothing. Dynamic bit operations like shift by a varable amount will cost a minimal amount of logic and can be done till the cows come home!
The biggest part has 3960 multipliers! And 142,200 slices which EACH one can be an 8 bit adder. (4 6Bit Luts per slice or 8 5bit Luts per slice depending on configuration.)

Pick a gnarly SW algorithm. Our company does HW acceleration of SW algo's for a living.
We've done HW implementations of regular expression engines that will do 1000's of rule-sets in parallel at speeds up to 10Gb/sec. The target market for that is routers where anti-virus and ips/ids can run real-time as the data is streaming by without it slowing down the router.
We've done HD video encoding in HW. It used to take several hours of processing time per second of film to convert it to HD. Now we can do it almost real-time...it takes almost 2 seconds of processing to convert 1 second of film. Netflix's used our HW almost exclusively for their video on demand product.
We've even done simple stuff like RSA, 3DES, and AES encryption and decryption in HW. We've done simple zip/unzip in HW. The target market for that is for security video cameras. The government has some massive amount of video cameras generating huge streams of real-time data. They zip it down in real-time before sending it over their network, and then unzip it in real-time on the other end.
Heck, another company I worked for used to do radar receivers using FPGA's. They would sample the digitized enemy radar data directly several different antennas, and from the time delta of arrival, figure out what direction and how far away the enemy transmitter is. Heck, we could even check the unintended modulation on pulse of the signals in the FPGA's to figure out the fingerprint of specific transmitters, so we could know that this signal is coming from a specific Russian SAM site that used to be stationed at a different border, so we could track weapons movements and sales.
Try doing that in software!! :-)

For pure speed:
- Paralizable ones
- DSP, e.g. video filters
- Moving data, e.g. DMA

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio