Is this simulation appropriate for CUDA or OpenCL? - parallel-processing

I'm asking on behalf of a friend working in numerical astrophysics.
Basically what he's doing is simulating a cloud of gas. There are a finite number of cells and the timestep is defined such that gas cannot cross more than one cell each step. Each cell has properties like density and temperature. Each timestep, these (and position) need to be calculated. It's mainly position that's the issue I believe as that is affected primarily by the interactions of gravity among the cells, all of which affect each other.
At the moment he's running this on a cluster of ~150 nodes but I wondered, if it's parallelizable like this, could it be run faster on a few GPUs with CUDA? At the moment it takes him a couple of days to finish a simulation. As GPUs generally have ~500 cores, it seemed like they could provide a boost.
Maybe I'm totally wrong.

Yes this sounds like a decent application for a GPU. GPU processing is most effective when it's running the same function on a large data set. If you've already got it running in parallel on a cluster computer, I'd say write it and test it on a single graphics card, and see if that's an improvement on a single cluster, then scale accordingly.

The task you describe is a good fit for the GPU. GPUs have successfully been used for dramatically improving the performance in areas such as particle, aerodynamics and fluid simulations.

Without knowing more details about the simulation it's impossible to say for sure whether it would gain a performance boost. Broadly speaking, algorithms that are memory bound ( that is, relatively few arithmetic operations per memory transaction ) tend to benefit most from offloading to the GPU.
For astrophysics simulations specifically, the following link may be of use : http://www.astrogpu.org/

Related

How to allocate more cpu usage to a program

I am running only one program on my computer to crunch numbers, and it takes up about 25% CPU (all other built-in applications are less than 4% CPU). Since this is the only program I am running, how do I raise the CPU percentage from 25% to 40%? I know changing the priority doesn't really help that much, or the affinity. I am using Windows 10. Thanks for help!
Distributing a demanding computational task (aka number crunching) among multiple processors or cores is usually not a trivial challenge. Likelihood of success depends upon how easy it is to divide the problem into sub-problems, each of which either doesn't need to communicate with each other or need a sufficiently small amount of communication such that communication overhead does not spoil all the speed gain you theoretically could get by using multiple processors.
Being as it is, this is usually a case-by-case decision. If you are lucky there is a special library for your problem domain for you ready to use.
Examples of problems that lend themselves to parallelization (quite) well are
video encoding (different time sections are practically independent of each other and can be encoded separately)
fractals like Mandelbrot (any area of the fractal is completely independent of the others)
explicit structural mechanics equations like crash solvers (solution volumes interact only across surface boundaries, so some communication between processors is necessary, but not much)
Examples of things that don't go so well with parallelization:
dense matrix inversion (maximum dependence of each component from every other component)
implicit structural mechanics equations, like nonlinear equilibrium solution (requires matrix inversion to solve, so same problem as before)
You (probably) can't do that.
The reason that you cannot do that is probably that the program you run is Single-Threaded and you have a Quad-Core-Processor. 25% is a quarter of a whole (processor). This means that one core of your Four-Core-Processor is fully used - resulting in a 25% usage.
Unless you can make your software Multi-Threaded (that means using multiple cores parallely) you are stuck with this limit.

Parallelizing complex and data intensive calculations on the GPU

Preface: I'm sorry that this a very open-ended question, since it would be quite complex to go into the exact problem I am working on, and I think an abstract formulation also contains the necessary detail. If more details are needed though, feel free to ask.
Efficiency in GPU computing comes from being able to parallelize calculations over thousands of cores, even though these run more slowly than traditional CPU cores. I am wondering if this idea can be applied to the problem I am working on.
The problem I am working on is an optimisation problem, where a potential solution is generated, the quality of this solution calculated, and compared to the current best solution, in order to approach the best solution possible.
In the current algorithm, a variation of gradient descent, the calculating of this penalty is what takes by far the most processor time (Profiling suggest around 5% of the time is used to generate a new valid possibility, and 95% of the time is used to calculate the penalty). However, the calculating of this penalty is quite a complex process, where different parts of the (potential) solution depend on eachother, and are subject to multiple different constraints for which a penalty may be given to the solution - the data model for this problem currently takes over 200MB of RAM to store.
Are there strategies in which to write an algorithm for such a problem on the GPU? My problem is currently that the datamodel needs to be loaded for each processor core/thread working the problem, since the generating of a new solution takes so little time, it would be inefficient to start using locks and have to wait for a processor to be done with its penalty calculation.
A GPU obviously doesn't have this amount of memory available for each of its cores. However, my understanding is that if the model were to be stored on RAM, the overhead of communication between the GPU and the CPU would greatly slow down the algorithm (Currently around 1 million of these penalty calculations are performed every second on a single core of a fairly modern CPU, and I'm guessing a million transfers of data to the GPU every second would quickly become a bottleneck).
If anyone has any insights, or even a reference to a similar problem, I would be most grateful, since my own searches have not yet turned up much.

Julia parallel speedup performance for large scale computations

General context:
I have developed a fairly large Navier-Stokes (finite difference) solver written in FORTRAN90. It has adaptive grids (hence load-balance issue), and I have tried various techniques (MPI, OpenMP & OpenMP-MPI hyrbid) to parallelize it. However, it does not scale good enough i.e. according to Amdahl's law it runs 96-97% of the computations in parallel. Also, the general size of the mesh is a couple of hundred million points, which would require to increase later in the future.
Query:
Now, I am thinking of switching to Julia, since it has become very tedious to maintain and add further functionalities to the existing code.
The problem is that I am unable to find a good answer about the parallel performance of Julia. I have searched on the internet as well as have watched a lot of youtube videos. What I have noticed is that most people say that Julia is very much suitable for the parallel computing, some even provide a bar chart showing the reduction in the elapsed time compared to the serial code. However, some of the answers/videos are quite old, which make them a little unreliable due to the growing nature of this new language.
Therefore, I would like to know if the language has the ability to scale even for a few thousand cores?
Extra information:
I am still trying hard to improve the speedup of my existing code to achieve almost linear performance for a couple of thousand cores. The solver needs to exchange overlapping points 3-4 times per timestep. Hence, it involves a huge communication overhead. However, the non-adaptive grid version of the code easily scales up to 20k cores.
I have also read somewhere that Julia does not use InfiniBand standard for data communication in parallel.
The following paper has scaling results for pde constrained parameter estimation problems but not up to anywhere near the number of cores you seem to be interested in: https://arxiv.org/abs/1606.07399. I haven't seen any examples going up to thousands of cores.
Re infiniband: By default Julia uses shared memory for communication within a node and TCP/IP across nodes, so by default infiniband is not supported. However, the language allows for the implementation of custom transports and I imagine someone will add infiniband support at some point but I couldn't find any implementations with a quick google search.

MATLAB Parallel Computing Toolbox - Parallelization vs GPU?

I'm working with someone who has some MATLAB code that they want to be sped up. They are currently trying to convert all of this code into CUDA to get it to run on a CPU. I think it would be faster to use MATLAB's parallel computing toolbox to speed this up, and run it on a cluster that has MATLAB's Distributed Computing Toolbox, allowing me to run this across several different worker nodes. Now, as part of the parallel computing toolbox, you can use things like GPUArray. However, I'm confused as to how this would work. Are using things like parfor (parallelization) and gpuarray (gpu programming) compatible with each other? Can I use both? Can something be split across different worker nodes (parallelization) while also making use of whatever GPUs are available on each worker?
They think its still worth exploring the time it takes to convert all of your matlab code to cuda code to run on a machine with multiple GPUs...but I think the right approach would be to use the features already built into MATLAB.
Any help, advice, direction would be really appreciated!
Thanks!
When you use parfor, you are effectively dividing your for loop into tasks, with one task per loop iteration, and splitting up those tasks to be computed in parallel by several workers where each worker can be thought of as a MATLAB session without an interactive GUI. You configure your cluster to run a specified number of workers on each node of the cluster (generally, you would choose to run a number of workers equal to the number of available processor cores on that node).
On the other hand, gpuarray indicates to MATLAB that you want to make a matrix available for processing by the GPU. Underneath the hood, MATLAB is marshalling the data from main memory to the graphics board's internal memory. Certain MATLAB functions (there's a list of them in the documentation) can operate on gpuarrays and the computation happens on the GPU.
The key differences between the two techniques are that parfor computations happen on the CPUs of nodes of the cluster with direct access to main memory. CPU cores typically have a high clock rate, but there are typically fewer of them in a CPU cluster than there are GPU cores. Individually, GPU cores are slower than a typical CPU core and their use requires that data be transferred from main memory to video memory and back again, but there are many more of them in a cluster. As far as I know, hybrid approaches are supposed to be possible, in which you have a cluster of PCs and each PC has one or more Nvidia Tesla boards and you use both parfor loops and gpuarrays. However, I haven't had occasion to try this yet.
If you are mainly interested in simulations, GPU processing is the perfect choice. However, if you want to analyse (big) data, go with Parallization. The reason for this is, that GPU processing is only faster than cpu processing if you don't have to copy data back and forth. In case of a simulation, you can generate most of the data on the GPU and only need to copy the result back. If you try to work with bigger data on the GPU you will very often run into out of memory problems.
Parallization is great if you have big data structures and more than 2 cores in your computer CPU.
If you write it in CUDA it is guaranteed to run in parallel at the chip-level versus going with MATLAB's best guess for a non-parallel architecture and your best effort to get it to run in parallel.
Kind of like drinking fresh mountain water run-off versus buying filtered water. Go with the purist solution.

When does Erlang's parallelism overcome its weaknesses in numeric computing?

With all the hype around parallel computing lately, I've been thinking a lot about parallelism, number crunching, clusters, etc...
I started reading Learn You Some Erlang. As more people are learning (myself included), Erlang handles concurrency in a very impressive, elegant way.
Then the author asserts that Erlang is not ideal for number crunching. I can understand that a language like Erlang would be slower than C, but the model for concurrency seems ideally suited to things like image handling or matrix multiplication, even though the author specifically says its not.
Is it really that bad? Is there a tipping point where Erlang's strength overcomes its local speed weakness? Are/what measures are being taken to deal with speed?
To be clear: I'm not trying to start a debate; I just want to know.
It's a mistake to think of parallelism as only about raw number crunching power. Erlang is closer to the way a cluster computer works than, say, a GPU or classic supercomputer.
In modern GPUs and old-style supercomputers, performance is all about vectorized arithmetic, special-purpose calculation hardware, and low-latency communication between processing units. Because communication latency is low and each individual computing unit is very fast, the ideal usage pattern is to load the machine's RAM up with data and have it crunch it all at once. This processing might involve lots of data passing among the nodes, as happens in image processing or 3D, where there are lots of CPU-bound tasks to do to transform the data from input form to output form. This type of machine is a poor choice when you frequently have to go to a disk, network, or some other slow I/O channel for data. This idles at least one expensive, specialized processor, and probably also chokes the data processing pipeline so nothing else gets done, either.
If your program requires heavy use of slow I/O channels, a better type of machine is one with many cheap independent processors, like a cluster. You can run Erlang on a single machine, in which case you get something like a cluster within that machine, or you can easily run it on an actual hardware cluster, in which case you have a cluster of clusters. Here, communication overhead still idles processing units, but because you have many processing units running on each bit of computing hardware, Erlang can switch to one of the other processes instantaneously. If it happens that an entire machine is sitting there waiting on I/O, you still have the other nodes in the hardware cluster that can operate independently. This model only breaks down when the communication overhead is so high that every node is waiting on some other node, or for general I/O, in which case you either need faster I/O or more nodes, both of which Erlang naturally takes advantage of.
Communication and control systems are ideal applications of Erlang because each individual processing task takes little CPU and only occasionally needs to communicate with other processing nodes. Most of the time, each process is operating independently, each taking a tiny fraction of the CPU power. The most important thing here is the ability to handle many thousands of these efficiently.
The classic case where you absolutely need a classic supercomputer is weather prediction. Here, you divide the atmosphere up into cubes and do physics simulations to find out what happens in each cube, but you can't use a cluster because air moves between each cube, so each cube is constantly communicating with its 6 adjacent neighbors. (Air doesn't go through the edges or corners of a cube, being infinitely fine, so it doesn't talk to the other 20 neighboring cubes.) Run this on a cluster, whether running Erlang on it or some other system, and it instantly becomes I/O bound.
Is there a tipping point where Erlang's strength overcomes its local speed weakness?
Well, of course there is. For example, when trying to find the median of a trillion numbers :) :
http://matpalm.com/median/question.html
Just before you posted, I happened to notice this was the number 1 post on erlang.reddit.com.
Almost any language can be parallelized. In some languages it's simple, in others it's a pain in the butt, but it can be done. If you want to run a C++ program across 8000 CPU's in a grid, go ahead! You can do that. It's been done before.
Erlang doesn't do anything that's impossible in other languages. If a single CPU running an Erlang program is less efficient than the same CPU running a C++ program, then two hundred CPU's running Erlang will also be slower than two hundred CPU's running C++.
What Erlang does do is making this kind of parallelism easy to work with. It saves developer time and reduces the chance of bugs.
So I'm going to say no, there is no tipping point at which Erlang's parallelism allows it to outperform another language's numerical number-crunching strength.
Where Erlang scores is in making it easier to scale out and do so correctly. But it can still be done in other languages which are better at number-crunching, if you're willing to spend the extra development time.
And of course, let's not forget the good old point that languages don't have a speed.
A sufficiently good Erlang compiler would yield perfectly optimal code. A sufficiently bad C compiler would yield code that runs slower than anything else.
There is pressure to make Erlang execute numeric code faster. The HiPe compiler compiles to native code instead of the BEAM bytecode for example, and it probably has its most effective optimization on code on floating points where it can avoid boxing. This is very beneficial for floating point code, since it can store values directly in FPU registers.
For the majority of Erlang usage, Erlang is plenty fast as it is. They use Erlang to write always-up control systems where the most important speed measurement that matters is low latency responses. Performance under load tends to be IO-bound. These users tend to stay away from HiPe since it is not as flexible/malleable in debugging live systems.
Now that servers with 128Gb of RAM are not that uncommon, and there's no reason they'll get even more memory, some IO-bound problems might shift over to be somewhat CPU bound. That could be a driver.
You should follow HiPe for the development.
Your examples of image manipulations and matrix multiplications seem to me as very bad matches for Erlang though. Those are examples that benefit from vector/SIMD operations. Erlang is not good at parallellism (where one does the same thing to multiple values at once).
Erlang processes are MIMD, multiple instructions multiple data. Erlang does lots of branching behind pattern matching and recursive loops. That kills CPU instruction pipelining.
The best architecture for heavily parallellised problems are the GPUs. For programming GPUs in a functional language I see the best potential in using Haskell for creating programs targeting them. A GPU is basically a pure function from input data to output data. See the Lava project in Haskell for creating FPGA circuits, if it is possible to create circuits so cleanly in Haskell, it can't be harder to create program data for GPUs.
The Cell architecture is very nice for vectorizable problems as well.
I think the broader need is to point out that parallelism is not necessarily or even typically about speed.
It is about how to express algorithms or programs in which the sequence of activities is partial-ordered.

Resources