Concurrent Communicating Kernels in CUDA?

Concurrent Communicating Kernels in CUDA? - random

Lets say I have a device with 2880 cuda cores.
I want to run a Monte Carlo simulation where:
2000 threads are each running a sample
880 threads are generating random numbers
This is because:
I only want 2000 samples therefore the other 880 would be sitting idle
I know that generating random numbers can be slow
Therefore I want to make a pool of random numbers that is replenished continuously by the 880 threads which the 2000 sample threads can take when required.
Is this possible? If so, please provide an example.

Strictly speaking what you propose does not even seem to be possible in CUDA and, as others point out, it's surely not a good idea. You may want to pick up a book or an online course first to familiarize with the GPU programming concepts.
More to the point, if you want to dive in straight, here's an MC pi example solved with CUDA, OpenACC and Thrust.

Related

Should I change my original Python/Tensorflow code to make it run faster on HPC?

Please forgive if this question is too basic. I am neither familiar with the idea of parallelization nor used a HPC system before.
I am training a deep learning model which takes really long on my PC. It takes approximately 2 days on my i5 with 12 GB RAM.
So I decided to use HPC but in one of the tutorials I watched, it says that if I do not write my code properly HPC will not be any faster than a regular PC. What is it really meant? Should I adjust my original code so that I can benefit HPC?
Secondly, can we say that using 30 cores should be 5 times faster than using 6 cores? Is speed and number of cores proportionate?

Q : "can we say that using 30 cores should be 5 times faster than using 6 cores?"
No, we can not.
Q : "Is speed and number of cores proportionate?"
No, it is not.
There is an ultimate ceiling for any (potential) speedup. The Amdahl's Law ( even in its original, overhead-naive, atomicity-of-work ignoring formulation ).
Better use the revised, overhead-strict, resources-aware Amdahl's Law re-formulation.
There you see.
In a seek for improving performance?
Start with this, best spending some time with tuning the core-parameters in the INTERACTIVE TOOL ( URL there ).
A conversion of a classical library (like TF or other ) into an HPC-efficient tool is not easy and does not come free - add-on overhead costs may easily (ref. the results in the INTERACTIVE TOOL) devastate any potential HPC-powers, just due to a poor scaling (going from costs in the range of a few ns to costs above a few ms is killing the game at whatever HPC-budget you may spend, isn't it? )

Yes that's true , if your code takes a very long time even an HPC wont be enough to run it fast , i mean that you can benefit from HPC's performances when the code is difficult to run on a regular PC for example due the low processor or RAM or any limited resources ... etc .
But in case that you write a code near to be a Non Polynomial Problem ( with a very high time complexity ) then even a HPC wont be enough for it , it will create a difference but not a wanted one for example you're writing a code with a very high time complexity which will take a regular computer 2 months to execute but only 1 month for a HPC

Is there a constant-time algorithm for generating a bandlimited sawtooth?

I'm looking into the feasibility of GPU synthesized audio, where each thread renders a sample. This puts some interesting restrictions on what algorithms can be used - any algorithm that refers to a previous set of samples cannot be implemented in this fashion.
Filtering is one of those algorithms. Bandpass, lowpass, or highpass - all of them require looking to the last few samples generated in order to compute the result. This can't be done because those samples haven't been generated yet.
This makes synthesizing bandlimited waveforms difficult. One approach is additive synthesis of partials using the fourier series. However, this runs at O(n) time, and is especially slow on a GPU to the point that the gain of parallelism is lost. If there were an algorithm that ran at O(1) time, this would eliminate branching AND be up to 1000x faster when dealing with the audible range.
I'm specifically looking for something like a DSF for a sawtooth. I've been trying to work out a simplification of the fourier series by hand, but that's really, really hard. Mainly because it involves harmonic numbers, AKA the only singularity of the Riemann-Zeta function.
Is a constant-time algorithm achievable? If not, can it be proven that it isn't?

Filtering is one of those algorithms. Bandpass, lowpass, or highpass - all of them require looking to the last few samples generated in order to compute the result. This can't be done because those samples haven't been generated yet.
That's not right. IIR filters do need previous results, but FIR filters only need previous input; that is pretty typical for the things that GPUs were designed to do, so it's not likely a problem to let every processing core access let's say 64 input samples to produce one output sample -- in fact, the cache architectures that Nvidia and AMD use lend themselves to that.
Is a constant-time algorithm achievable? If not, can it be proven that it isn't?
It is! In two aspects:
as mentioned above, FIR filters only need multiple samples of immutable input, so they can be parallelized heavily without problems, and
even if you need to calculate your input first, and would like to parallelize that (I don't see a reason for that -- generating a sawtooth is not CPU-limited, but memory bandwidth limited), every core could simply calculate the last N samples -- sure, there's N-1 redundant operations, but as long as your number of cores is much bigger than your N, you will still be faster, and every core will have constant run time.
Comments on your approach:
I'm looking into the feasibility of GPU synthesized audio, where each thread renders a sample.
From a higher-up perspective, that sounds too fine-granular. I mean, let's say you have 3000 stream processors (high-end consumer GPU). Assuming you have a sampling rate of 44.1kHz, and assuming each of these processors does only one sample, letting them all run once only gives you 1/14.7 of a second of audio (mono). Then you'd have to move on to the next part of audio.
In other words: There's bound to be much much more samples than processors. In these situations, it's typically way more efficient to let one processor handle a sequence of samples; for example, if you want to generate 30s of audio, that'd be 1.323MS (amples). Simply splitting the problem into 3000 chunks, one for each processor, and giving each of them the 44100*30/3000=441 samples they should process plus 64 samples of "history" before the first of their "own" samples will still easily fit into local memory.
Yet another thought:
I'm coming from a software defined radio background, where there's usually millions of samples per second, rather than a few kHz of sampling rate, in real time (i.e. processing speed > sampling rate). Still, doing computation on the GPU only pays for the more CPU-intense tasks, because there's significant overhead in exchanging data with the GPU, and CPUs nowadays are blazingly fast. So, for your relatively simple problem, it might never work faster to do things on the GPU compared to optimizing them on the CPU; things of course look different if you've got to process lots of samples, or a lot of streams, at once. For finer-granular tasks, the problem of filling a buffer, moving it to the GPU, and getting the result buffer back into your software usually kills the advantage.
Hence, I'd like to challenge you: Download the GNU Radio live DVD, burn it to a DVD or write it to a USB stick (you might as well run it in a VM, but that of course reduces performance if you don't know how to optimize your virtualizer; really - try it from a live medium), run
volk_profile
to let the VOLK library test which algorithms work best on your specific machine, and then launch
gnuradio-companion
And then, run open the following two signal processing flow graphs:
"classical FIR": This single-threaded implementation of the FIR filter yields about 50MSamples/s on my CPU.
FIR Filter implemented with the FFT, running on 4 threads: This implementation reaches 160MSamples/s (!!) on my CPU alone.
Sure, with the help of FFTs on my GPU, I could be faster, but the thing here is: Even with the "simple" FIR filter, I can, with a single CPU core, get 50 Megasamples out of my machine -- meaning that, with a 44.1kHz audio sampling rate, per single second I can process roughly 19 minutes of audio. No copying in and out of host RAM. No GPU cooler spinning up. It might really not be worth optimizing further. And if you optimize and take the FFT-Filter approach: 160MS/s means roughly one hour of audio per processing second, including sawtooth generation.

How to select the most powerful OpenCL device?

My computer has both an Intel GPU and an NVIDIA GPU. The latter is much more powerful and is my preferred device when performing heavy tasks. I need a way to programmatically determine which one of the devices to use.
I'm aware of the fact that it is hard to know which device is best suited for a particular task. What I need is to (programmatically) make a qualified guess using the variables listed below.
How would you rank these two devices? Intel HD Graphics 4400 to the left, GeForce GT 750M to the right.
GlobalMemoryCacheLineSize 64 vs 128
GlobalMemoryCacheSize 2097152 vs 32768
GlobalMemorySize 1837105152 vs 4294967296
HostUnifiedMemory true vs false
Image2DMaxHeight 16384 vs 32768
Image2DMaxWidth 16384 vs 32768
Image3DMaxDepth 2048 vs 4096
Image3DMaxHeight 2048 vs 4096
Image3DMaxWidth 2048 vs 4096
LocalMemorySize 65536 vs 49152
MaxClockFrequency 400 vs 1085
MaxComputeUnits 20 vs 2
MaxConstantArguments 8 vs 9
MaxMemoryAllocationSize 459276288 vs 1073741824
MaxParameterSize 1024 vs 4352
MaxReadImageArguments 128 vs 256
MaxSamplers 16 vs 32
MaxWorkGroupSize 512 vs 1024
MaxWorkItemSizes [512, 512, 512] vs [1024, 1024, 64]
MaxWriteImageArguments 8 vs 16
MemoryBaseAddressAlignment 1024 vs 4096
OpenCLCVersion 1.2 vs 1.1
ProfilingTimerResolution 80 vs 1000
VendorId 32902 vs 4318
Obviously, there are hundreds of other devices to consider. I need a general formula!

You can not have a simple formula to calculate an index from that parameters.
Explanation
First of all let me assume you can trust collected data, of course if you read 2 for MaxComputeUnits but in reality it's 80 then there is nothing you can do (unless you have your own database of cards with all their specifications).
How can you guess if you do not know task you have to perform? It may be something highly parallel (then more units may be better) or a raw brute calculation (then higher clock frequency or bigger cache may be better). As for normal CPU number of threads isn't the only factor you have to consider for parallel tasks. Just to mention few things you have to consider:
Cache: how much local data each task works with?
Memory: shared with CPU? How many concurrent accesses compared to parallel tasks?
Instruction set: do you need something specific that increases speed even if other parameters aren't so good?
Misc stuff: do you have some specific requirement, for example size of something that must be supported and a fallback method makes everything terribly slow?
To make it short: you can not calculate an index in a reliable way because factors are too many and they're strongly correlated (for example high parallelism may be slowed by small cache or slow memory access but a specific instruction, if supported, may give you great performance even if all other parameters are poor).
One Possible Solution
If you need a raw comparison you may even simply do MaxComputeUnits * MaxClockFrequency (and it may even be enough for many applications) but if you need a more accurate index then don't think it'll be an easy task and you'll get a general purpose formula like (a + b / 2)^2, it's not and results will be very specific to task you have to accomplish.
Write a small test (as much similar as possible to what your task is, take a look to this post on SO) and run it with many cards, with a big enough statistic you may extrapolate an index from an unknown set of parameters. Algorithms can become pretty complex and there is a vast literature about this topic so I won't even try to repeat them here. I would start with Wikipedia article as summary to other more specific papers. If you need an example of what you have to do you may read Exploring the Multiple-GPU Design Space.
Remember that more variables you add to your study more results quality will be unstable, less parameters you use less results will be accurate. To better support extrapolation:
After you collected enough data you should first select and reduce variables with some pre-analysis to a subset of them including only what influences more your benchmark results (for example MaxGroupSize may not be so relevant). This phase is really important and decisions should be made with statistic tools (you may for example calculate p-value).
Some parameters may have a great variability (memory size, number of units) but analysis would be easier with less values (for example [0..5) units, [5..10) units, [10..*) units). You should then partition data (watching their distribution). Different partitions may lead to very different results so you should try different combinations.
There are many other things to consider, a good book about data mining would help you more than 1000 words written here.

As #Adriano as pointed out, there are many things to take into considerations...too many things.
But I can think of few things (and easier things that could be done) to help you out (not to completely solve your problem) :
OCL Version
First thing first, which version of OCL do you need (not really related to performance). But if you use some feature of OCL 1.2...well problem solved
Memory or computation bound
You can usually (and crudely) categorized your algorithms in one of these two categories: memory bounded or computation bounded. In the case it's memory bound (with a lot of transfers between host and device) probably the most interesting info would be the device with Host Unified Memory. If not, the most powerful processors most probably would be more interesting.
Rough benchmark
But most probably it wouldn't be as easy to choose in which category put your application.
In that case you could make a small benchmark. Roughly, this benchmark would test different size of data (if your app has to deal with that) on dummy computations which would more or less match the amount of computations your application requires (estimated by you after you completed the development of your kernels). You could log the point where the amount of data is so big that it cancels the device most powerful but connected via PCIe.
GPU Occupancy
Another very important thing when programming on GPUs is the GPU occupancy. The higher, the best. NVIDIA provides an Excel file that calculates the occupancy based on some input. Based on these concepts, you could more or less reproduce the calculation of the occupancy (some adjustment will most probably needed for other vendors) for both GPUs and choose the one with the highest.
Of course, you need to know the values of these inputs. Some of them are based on your code, so you can calculate them before hands. Some of them are linked to the specs of the GPU. You can query some of them as you already did, for some others you might need to hardcode the values in some files after some googling (but at least you don't need to have these GPUs at hands to test on them). Last but not least, don't forget that OCL provides the clGetKernelWorkGroupInfo() which can provide you some info such as the amount of local or private memory needed by a specific kernel.
Regarding the info about the local memory please note that remark from the standard:
If the local memory size, for any pointer argument to the kernel
declared with the __local address qualifier, is not specified, its
size is assumed to be 0.
So, it means that this info could be useless if you have first to dynamically compute the size from the host side. A work-around for that could be to use the fact that the kernels are compiled in JIT. The idea here would be to use the preprocessor option -D when calling clBuildProgram() as I explained here. This would give you something like:
#define SIZE
__mykernel(args){
local myLocalMem[SIZE];
....
}
And what if the easier was:
After all the blabla. I'm guessing that you worry about this because you might want to ship your application to some users without knowing what hardware they have. Would it be very inconvenient (at install time or maybe after by providing them a command or a button) to simply run you application with dummy generated data to measure which device performed better and simply log it in a config file?
Or maybe:
Sometime, depending on you specific problem (that could not involve to many syncs) you don't have to choose. Sometime, you could just simply split the work between the two devices and use both...

Why guess? Choose dynamically on your hardware of the day: Take the code you wish to run on the "best" GPU and run it, on a small amount of sample data, on each available GPU. Whichever finishes first: use it for the rest of your calculations.

I'm loving all of the solutions so far. If it is important to make the best device selection automatically, that's how to do it (weight the values based on your usage needs and take the highest score).
Alternatively, and much simpler, is to just take the first GPU device, but also have a way for the user to see the list of compatible devices and change it (either right away or on the next run).
This alternative is reasonable because most systems only have one GPU.

At what rate are the number of cores per CPU increasing?

I'm designing a system that will be on-line in 2016 and run on commodity 1U or 2U server boxes. I'd like to understand how parallel the software will need to be so I'd like to estimate the number of cores per physical machine. I'm not interested in more exotic hardware like video game console processors, GPUs or DSPs. I could extrapolate based on when chips where issued by Intel or AMD, but this historical information seems scarce.
Thanks.

I found the following charts from Design for Manycore Systems:

As the great computer scientist Yogi Berra said, "It's tough to make predictions, especially about the future.". Given the relative recency of multicore systems, I think you're right to be wary of extrapolations. Still, you need a number to aim for.
M. Spinelli's graphs are very valuable, and (I think) have the benefit of being based on real plans out to 2014. Other than that, if you want a simple, easly calculatable and defensible number, I'd take as a starting point the number of cores in current (say) 2U systems at your price point (high range systems -- 24-32 cores at $15k; mid-range 12-16 cores at $8k, lower-end 8-12 core at $5k). Then note that Moore's law suggests 8-16x as many transistors per unit silicon in 2016 as now, and that on current trends, these mainly go into more cores. That suggests 64-512 cores per node depending on how much you're spending on each -- and these numbers are consistent with the graphs Matt Spinelli posted above.

Cores per physical machine doesn't seem to be a particularly good metric, I think. We haven't really seen that number grow in particularly non-linear ways, and many-core hardware has been available COTS since the 90's (though it was relatively specialized at that point). If your task is really that parallel, quadrupling the number of cores shouldn't change it that much. We've always had the option of faster-but-fewer-cores, which should still be available to you in 6 years if you find that you don't scale well with the current number of cores.
If your application is really embarrassingly parallel, why are you unwilling to consider GPU solutions?
How quickly do you plan to rotate the hardware? Leave old machines till they die, or replace them proactively as they start to slow the cluster down? How many machines are we talking about? What kind of interconnect technology are you considering? For many cluster applications that is the limiting factor.
The drdobbs article above is not a bad analysis, but I think it misses the point just a tad. It's going to be a significant while before many mainstream apps can take advantage of really parallel general compute hardware (and many tasks simply can't be parallelized much), and when they do, they'll be using graphics cards and (to a less extent) soundcards as the specialized hardware they use to do it.

How to scale cholesky factorization on multiple GPUs

I have implemented Cholesky Factorization for solving large linear equation on GPU using ATI Stream SDK. Now I want to exploit computation power of more and more GPUs and I want to run this code on multiple GPUs.
Currently I have One Machine and One GPU installed on it and cholesky factorization is running properly.
I want to do it for N machine and all have one GPU installed on them. So Suggest me how should I proceed.

First, you have to be aware that this approach will introduce three levels of latency for any communication between nodes:
GPU memory on machine 1 to main memory on machine 1
Main memory on machine 1 to main memory on machine 2
Main memory on machine 2 to GPU memory on machine 2
A good first step will be to do some back of the envelop calculations to determine if the speed up you gain by splitting the problem between multiple machines will outweigh the latency you introduce.
Once you're sure the approach is the one you want to follow, then it's pretty much up to you to implement this correctly. Note that, currently, NVIDIA's CUDA or OpenCL libraries will be better choices for you because they allow you to access the GPU for computation without having it coupled with an X session. Once ATI's OpenCL implementation supports the GPU, then this should also be a viable option.
Since you already have a working GPU implementation, here are the basic steps you must follow:
Determine how you update your factorization algorithm to support processing by separate nodes
Set up the data exchange between N computers (I notice you have opted for MPI for this)
Set up the scatter operation that will divide the input problem amongst the computational nodes
Set up the data exchange between a machine and its GPU
Set up the gather operation that will gather the results from the nodes into the one node

It's a very specialised question. Suggest you check the Stream developer resources and the Stream Developer Forums.

I showed this Q to a colleague of mine who knows about these things.
He suggested you use ScaLAPACK.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio