I am working on how to offload some workload to GPU using CUDA in SpringBoot project. To help me explain my question better, let me suppose that we want to implement a REST API to do matrix-vector multiplication in SpringBoot application. We need to load some matrices with various sizes to GPU's memory on application launch, then accept user's request with vector data and find the corresponding matrix inside GPU to do matrix-vector multiplication, and finally return the multiplication result to user. We have already implemented the kernel using JCuda.
In this scenario, we want to process users' requests concurrently, so there are several questions I am interested in:
How to avoid CUDA out of memory error when there are lots of REST API calls?
If we use explicit cuda streams to improve application's throughput, how to determine the number of cuda streams?
If we also need to do CUD operations to matrices in GPU's memory while processing REST API calls, how to make these operations and matrix-vector multiplication operations atomic?
Related
I am trying to design a convolution kernel code for CUDA. It will take relatively small pictures (typically for my application a 19 * 19 image)
In my research , i found most notably this paper : https://www.evl.uic.edu/sjames/cs525/final.html
I understand the concept of it, but I wonder, for small images, does using
a block by pixel of the original image, and using the threads of that block as the pixels to fetch , then do a block wide reduction, fast enough ? I made a basic implementation that makes global memory access coalescent, so, is it a good design for small pictures ? Or should I follow the "traditional" method ?
It all depends upon your eventual application for your program. If you intend to only convolute a few "relatively small pictures", as you mention, then a naive approach should be sufficient. In fact, a serial approach may even be faster due to memory transfer overhead between the CPU and GPU if you're not processing much data. I would recommend first writing the kernel which accesses global memory, as you mention, and if you will be working with a larger dataset in the future, it would make sense to attempt the "traditional" approach as well, and compare runtimes.
I'm working on an application where I real-time process a video feed on my GPU and once in a while I need to do some resource extensive calculations on my GPU besides that. My problem now is that I want to keep my video processing at real-time speed while doing the extra work in parallel once it comes up.
The way I think this should be done is with two command-queues, one for the real time video processing and one for the extensive calculations. However, I have no idea how this will turn out with the computing resources of the GPU: will there be equally many workers assigned to the command-queues during parallel execution? (so I could expect a slowdown of about 50% of my real-time computations?) Or is it device dependent?
The OpenCL specification leaves it up to the vendor to decide how to balance execution resources between multiple command queues. So a vendor could implement OpenCL in such a way that causes the GPU to work on only one kernel at a time. That would be a legal implementation, in my opinion.
If you really want to solve your problem in a device-independent way, I think you need to figure out how to break up your large non-real-time computation into smaller computations.
AMD has some extensions (some of which I think got adopted in OpenCL 1.2) for device fission, which means you can reserve some portion of the device for one context and use the rest for others.
Intel's Integrated Performance Primitives (IPP) library has a feature called Deferred Mode Image Processing (DMIP). It lets you specify a sequence of functions, composes the functions, and applies the composed function to an array via cache-friendly tiled processing. This gives better performance than naively iterating through the whole array for each function.
It seems like this technique would benefit code running on a GPU as well. There are many GPU libraries available, such as NVIDIA Performance Primitives (NPP), but none seem to have a feature like DMIP. Am I missing something? Or is there a reason that GPU libraries would not benefit from automated function composition?
GPU programming has similar concepts to DMIP function composition on CPU.
Although it's not easy to be "automated" on GPU (some 3rd party lib may be able to do it), manually doing it is easier than CPU programming (see Thrust example below).
Two main features of DMIP:
processing by the image fragments so data can fit into cache;
parallel processing to different fragments or execution of different independent branches of a graph.
When applying a sequence of basic operations on a large image,
feature 1 omits RAM read/write between basic operations. All the read/write is done in cache, and feature 2 can utilize multi-core CPU.
The similar concept of DMIP feature 1 for GPGPU is kernel fusion. Instead of applying multiple kernels of the basic operation to image data, one could combine basic operations in one kernel to avoid multiple GPU global memory read/write.
A manually kernel fusion example using Thrust can be found in page 26 of this slides.
It seems the library ArrayFire has made notable efforts on automaic kernel fusion.
The similar concept of DMIP feature 2 for GPGPU is concurrent kernel execution.
This feature enlarge the bandwidth requirement, but most GPGPU programs are already bandwidth bound. So the concurrent kernel execution is not likely to be used very often.
CPU cache vs. GPGPU shared mem/cache
CPU cache omits the RAM read/write in DMIP, while for a fused kernel of GPGPU, registers do the same thing. Since a thread of CPU in DMIP processes on a small image fragment, but a thread of GPGPU often processes only one pixel. A few registers are large enough to buffer the data of a GPU thread.
For image processing, GPGPU shared mem/cache is often used when the result pixel depends on surrounding pixels. Image smoothing/filtering is a typical example requiring GPGPU shared mem/cache.
I am working on some signal processing code in SciPy, and am now trying to use a numerical optimizer to tune it. Unfortunately, as these things go, it is turning out to be quite a slow process.
The operations I must perform for this optimization are the following:
Load a large 1-d data file (~ 120000 points)
Run optimizer, which:
Executes a signal processing operation, does not modify original data, produces 120000 new data points.
Examines difference between original signal and new signal using various operations,
One of which includes FFT-based convolution
Generates a single "error" value to summarise the result -- this is what should be minimized
Looks at error and re-runs operation with different parameters
The signal processing and error functions take under 3 seconds, but unfortunately doing it 50,000 times takes much longer. I am experimenting with various more efficient optimisation algorithms, but no matter what it's going to take thousands of iterations.
I have parallelised a couple of the optimisers I'm trying using CPU threads, which wasn't too difficult since the optimiser can easily perform several scheduled runs at once on separate threads using ThreadPool.map.
But this is only about a 2x speed-up on my laptop, or maybe 8x on a multicore computer. My question is, is this an application for which I could make use of GPU processing? I have already translated some parts of the code to C, and I could imagine using OpenCL to create a function from an array of parameters to an array of error values, and running this hundreds of times at once. -- Even if it performs the sequential processing part slowly, getting all the results in one shot would be amazing.
However, my guess is that the memory requirements (loading up a large file and producing a temporary one of equal size to generate every data point) would make it difficult to run the whole algorithm in an OpenCL kernel. I don't have much experience with GPU processing and writing CUDA/OpenCL code, so I don't want to set about learning the ins and outs if there is no hope in making it work.
Any advice?
Do you need to produce all 120,000 new points before analysing the difference? Could you calculate the new point, then decide for that point if you are converging?
How big are the points? A $50 graphics card today has 1Gb of memory - should be plenty for 120K points. I'm not as familiar with openCL as Cuda but there may also be limits on how much of this is texture memory vs general memory etc.
edit: More familiar with CUDA than OpenCL but this probably applies to both.
The memory on GPUs is a bit more complex but very flexible, you have texture memory that can be read by the GPU kernel and has some very clever cache features to make access to values in a 2d and 3d arrays very fast. There is openGL memory that you can write to for display and there is a limited (16-64k ?) cache per thread
Although transfers from main memory to the GPU are relatively slow ( few GB/s) the internal memory bus on the graphics card is 20x as fast as this
I have implemented Cholesky Factorization for solving large linear equation on GPU using ATI Stream SDK. Now I want to exploit computation power of more and more GPUs and I want to run this code on multiple GPUs.
Currently I have One Machine and One GPU installed on it and cholesky factorization is running properly.
I want to do it for N machine and all have one GPU installed on them. So Suggest me how should I proceed.
First, you have to be aware that this approach will introduce three levels of latency for any communication between nodes:
GPU memory on machine 1 to main memory on machine 1
Main memory on machine 1 to main memory on machine 2
Main memory on machine 2 to GPU memory on machine 2
A good first step will be to do some back of the envelop calculations to determine if the speed up you gain by splitting the problem between multiple machines will outweigh the latency you introduce.
Once you're sure the approach is the one you want to follow, then it's pretty much up to you to implement this correctly. Note that, currently, NVIDIA's CUDA or OpenCL libraries will be better choices for you because they allow you to access the GPU for computation without having it coupled with an X session. Once ATI's OpenCL implementation supports the GPU, then this should also be a viable option.
Since you already have a working GPU implementation, here are the basic steps you must follow:
Determine how you update your factorization algorithm to support processing by separate nodes
Set up the data exchange between N computers (I notice you have opted for MPI for this)
Set up the scatter operation that will divide the input problem amongst the computational nodes
Set up the data exchange between a machine and its GPU
Set up the gather operation that will gather the results from the nodes into the one node
It's a very specialised question. Suggest you check the Stream developer resources and the Stream Developer Forums.
I showed this Q to a colleague of mine who knows about these things.
He suggested you use ScaLAPACK.