what's make GPU programs non-preemptible? - parallel-processing

Modern operating systems have no support for GPUs, treating them more or less as a normal I/O device.there some researches in that areas attempt to managing GPUs at operating system level,but they claim that the GPU programs are non-preemptible: once a work unit has been started, it’s impossible to interrupt it without destroying the channel’s data.
so what i am asking is:
Is it true that it's non-preemtible?
If it's non-preemtible ,what make it non-preemtible ,is it because of hardware
design or what is the reason?
If it non-preemtible what we need to make it preemtible?
i'll be highly appreciated if someone can give a clear explanation.

GPU preempt themselves all the time, but only with other work items from the same kernel. If a compute unit is waiting on a memory read or write it will execute other work items. It's single instruction multiple threads essentially. However, it doesn't make sense to stop a job part way through and switch to a different job. You'd need to keep track of an enormous amount of state (unlike a serial processor that just has a register set, you'd have all that multiplied by the number of compute units). GPU jobs are all designed to run quickly, so cycling jobs through the system is more efficient that switching between partially complete jobs. That all said, some modern GPUs divide up the hardware and can have different parts working on different jobs at the same time.

At the risk of over simplification:
Let's say I have a solid object defined within the GPU. For simplicity, let's say that the object is a cube and that the GPU maintains 8 vertices (and that the GPU is VERY slow).
Let me start a rotation. I have do a matrix multiplication on each vertex. I do 3 of them. Then I get preempted.
My cube is no longer a cube.
If you wanted it preemtable, you'd need some kind of transaction processing with rollback (slowing things down) and hardware support giving a preemptable interface.

Related

CPU SIMD vs GPU SIMD?

GPU uses the SIMD paradigm, that is, the same portion of code will be executed in parallel, and applied to various elements of a data set.
However, CPU also uses SIMD, and provide instruction-level parallelism. For example, as far as I know, SSE-like instructions will process data elements with parallelism.
While the SIMD paradigm seems to be used differently in GPU and CPU, does GPUs have more SIMD power than CPUs?
In which way the parallel computational capabilities in a CPU are 'weaker' than the ones in a GPU?
Both CPUs & GPUs provide SIMD with the most standard conceptual unit being 16 bytes/128 bits; for example a Vector of 4 floats (x,y,z,w).
Simplifying:
CPUs then parallelize more through pipelining future instructions so they proceed faster through a program. Then next step is multiple cores which run independent programs.
GPUs on the other hand parallelize by continuing the SIMD approach and executing the same program multiple times; both by pure SIMD where a set of programs execute in lock step (which is why branching is bad on a GPU, as both sides of an if statement must execute; and one result be thrown away so that the lock step programs proceed at the same rate); and also by single program, multiple data (SPMD) where groups of the sets of identical programs proceed in parallel but not necessarily in lock step.
The GPU approach is great where the exact same processing needs be applied to large volumes of data; for example a million vertices than need to be transformed in the same way, or many million pixels that need the processing to produce their colour. Assuming they don't become data block/pipeline stalled, GPUs programs general offer more predictable time bound execution due to its restrictions; which again is good for temporal parallelism e.g. the programs need to repeat their cycle at a certain rate for example 60 times a second (16ms) for 60 fps.
The CPU approach however is better for decisioning and performing multiple different tasks at the same time and dealing with changing inputs and requests.
Apart from its many other uses and purposes, the CPU is used to orchestrate work for the GPU to perform.
It's a similar idea, it goes kind of like this (very informally speaking):
The CPU has a set amount of functions that can run on packed values. Depending on your brand and version of your CPU, you might have access to SSE2, 3, 4, 3dnow, etc, and each of them gives you access to more and more functions. You're limited by the register size and the larger data types you work with the less values you can use in parallel. You can freely mix and match SIMD instructions with traditional x86/x64 instructions.
The GPU lets you write your entire pipeline for each pixel of a texture. The texture size doesn't depend on your pipeline length, ie the number of values you can affect in one cycle isn't dependant on anything but your GPU, and the functions you can chain (your pixel shader) can be pretty much anything. It's somewhat more rigid though in that the setup and readback of your values is somewhat slower, and it's a one shot process (load values, run shader, read values), you can't massage them at all besides that, so you actually need to use a lot of values for it to be worth it.

OpenCL: work group concept

I don't really understand the purpose of Work-Groups in OpenCL.
I understand that they are a group of Work Items (supposedly, hardware threads), which ones get executed in parallel.
However, why is there this need of coarser subdivision ? Wouldn't it be OK to have only the grid of threads (and, de facto, only one W-G)?
Should a Work-Group exactly map to a physical core ? For example, the TESLA c1060 card is said to have 240 cores. How would the Work-Groups map to this??
Also, as far as I understand, work-items inside a work group can be synchronized thanks to memory fences. Can work-groups synchronize or is that even needed ? Do they talk to each other via shared memory or is this only for work items (not sure on this one)?
Part of the confusion here I think comes down to terminology. What GPU people often call cores, aren't really, and what GPU people often call threads are only in a certain sense.
Cores
A core, in GPU marketing terms may refer to something like a CPU core, or it may refer to a single lane of a SIMD unit - in effect a single core x86 CPU would be four cores of this simpler type. This is why GPU core counts can be so high. It isn't really a fair comparison, you have to divide by 16, 32 or a similar number to get a more directly comparable core count.
Work-items
Each work-item in OpenCL is a thread in terms of its control flow, and its memory model. The hardware may run multiple work-items on a single thread, and you can easily picture this by imagining four OpenCL work-items operating on the separate lanes of an SSE vector. It would simply be compiler trickery that achieves that, and on GPUs it tends to be a mixture of compiler trickery and hardware assistance. OpenCL 2.0 actually exposes this underlying hardware thread concept through sub-groups, so there is another level of hierarchy to deal with.
Work-groups
Each work-group contains a set of work-items that must be able to make progress in the presence of barriers. In practice this means that it is a set, all of whose state is able to exist at the same time, such that when a synchronization primitive is encountered there is little overhead in switching between them and there is a guarantee that the switch is possible.
A work-group must map to a single compute unit, which realistically means an entire work-group fits on a single entity that CPU people would call a core - CUDA would call it a multiprocessor (depending on the generation), AMD a compute unit and others have different names. This locality of execution leads to more efficient synchronization, but it also means that the set of work-items can have access to locally constructed memory units. They are expected to communicate frequently, or barriers wouldn't be used, and to make this communication efficient there may be local caches (similar to a CPU L1) or scratchpad memories (local memory in OpenCL).
As long as barriers are used, work-groups can synchronize internally, between work-items, using local memory, or by using global memory. Work-groups cannot synchronize with each other and the standard makes no guarantees on forward progress of work-groups relative to each other, which makes building portable locking and synchronization primitives effectively impossible.
A lot of this is due to history rather than design. GPU hardware has long been designed to construct vector threads and assign them to execution units in a fashion that optimally processes triangles. OpenCL falls out of generalising that hardware to be useful for other things, but not generalising it so much that it becomes inefficient to implement.
There are already alot of good answers, for further understanding of the terminology of OpenCL this paper ("An Introduction to the OpenCL Programming Model" by Jonathan Tompson and Kristofer Schlachter) actually describes all the concepts very well.
Use of the work-groups allows more optimization for the kernel compilers. This is because data is not transferred between work-groups. Depending on used OpenCL device, there might be caches that can be used for local variables to result faster data accesses. If there is only one work-group, local variables would be just the same as global variables which would lead to slower data accesses.
Also, usually OpenCL devices use Single Instruction Multiple Data (SIMD) extensions to achieve good parallelism. One work group can be run in parallel with SIMD extensions.
Should a Work-Group exactly map to a physical core ?
I think that, only way to find the fastest work-group size, is to try different work-group sizes. It is also possible to query the CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE from the device with clGetKernelWorkGroupInfo. The fastest size should be multiple of that.
Can work-groups synchronize or is that even needed ?
Work-groups cannot be synchronized. This way there is no data dependencies between them and they can also be run sequentially, if that is considered to be the fastest way to run them. To achieve same result, than synchronization between work-groups, kernel needs to split into multiple kernels. Variables can be transferred between the kernels with buffers.
One benefit of work groups is they enable using shared local memory as a programmer-defined cache. A value read from global memory can be stored in shared work-group local memory and then accessed quickly by any work item in the work group. A good example is the game of life: each cell depends on itself and the 8 around it. If each work item read this information you'd have 9x global memory reads. By using work groups and shared local memory you can approach 1x global memory reads (only approach since there is redundant reads at the edges).

Is this simulation appropriate for CUDA or OpenCL?

I'm asking on behalf of a friend working in numerical astrophysics.
Basically what he's doing is simulating a cloud of gas. There are a finite number of cells and the timestep is defined such that gas cannot cross more than one cell each step. Each cell has properties like density and temperature. Each timestep, these (and position) need to be calculated. It's mainly position that's the issue I believe as that is affected primarily by the interactions of gravity among the cells, all of which affect each other.
At the moment he's running this on a cluster of ~150 nodes but I wondered, if it's parallelizable like this, could it be run faster on a few GPUs with CUDA? At the moment it takes him a couple of days to finish a simulation. As GPUs generally have ~500 cores, it seemed like they could provide a boost.
Maybe I'm totally wrong.
Yes this sounds like a decent application for a GPU. GPU processing is most effective when it's running the same function on a large data set. If you've already got it running in parallel on a cluster computer, I'd say write it and test it on a single graphics card, and see if that's an improvement on a single cluster, then scale accordingly.
The task you describe is a good fit for the GPU. GPUs have successfully been used for dramatically improving the performance in areas such as particle, aerodynamics and fluid simulations.
Without knowing more details about the simulation it's impossible to say for sure whether it would gain a performance boost. Broadly speaking, algorithms that are memory bound ( that is, relatively few arithmetic operations per memory transaction ) tend to benefit most from offloading to the GPU.
For astrophysics simulations specifically, the following link may be of use : http://www.astrogpu.org/

MATLAB Parallel Computing Toolbox - Parallelization vs GPU?

I'm working with someone who has some MATLAB code that they want to be sped up. They are currently trying to convert all of this code into CUDA to get it to run on a CPU. I think it would be faster to use MATLAB's parallel computing toolbox to speed this up, and run it on a cluster that has MATLAB's Distributed Computing Toolbox, allowing me to run this across several different worker nodes. Now, as part of the parallel computing toolbox, you can use things like GPUArray. However, I'm confused as to how this would work. Are using things like parfor (parallelization) and gpuarray (gpu programming) compatible with each other? Can I use both? Can something be split across different worker nodes (parallelization) while also making use of whatever GPUs are available on each worker?
They think its still worth exploring the time it takes to convert all of your matlab code to cuda code to run on a machine with multiple GPUs...but I think the right approach would be to use the features already built into MATLAB.
Any help, advice, direction would be really appreciated!
Thanks!
When you use parfor, you are effectively dividing your for loop into tasks, with one task per loop iteration, and splitting up those tasks to be computed in parallel by several workers where each worker can be thought of as a MATLAB session without an interactive GUI. You configure your cluster to run a specified number of workers on each node of the cluster (generally, you would choose to run a number of workers equal to the number of available processor cores on that node).
On the other hand, gpuarray indicates to MATLAB that you want to make a matrix available for processing by the GPU. Underneath the hood, MATLAB is marshalling the data from main memory to the graphics board's internal memory. Certain MATLAB functions (there's a list of them in the documentation) can operate on gpuarrays and the computation happens on the GPU.
The key differences between the two techniques are that parfor computations happen on the CPUs of nodes of the cluster with direct access to main memory. CPU cores typically have a high clock rate, but there are typically fewer of them in a CPU cluster than there are GPU cores. Individually, GPU cores are slower than a typical CPU core and their use requires that data be transferred from main memory to video memory and back again, but there are many more of them in a cluster. As far as I know, hybrid approaches are supposed to be possible, in which you have a cluster of PCs and each PC has one or more Nvidia Tesla boards and you use both parfor loops and gpuarrays. However, I haven't had occasion to try this yet.
If you are mainly interested in simulations, GPU processing is the perfect choice. However, if you want to analyse (big) data, go with Parallization. The reason for this is, that GPU processing is only faster than cpu processing if you don't have to copy data back and forth. In case of a simulation, you can generate most of the data on the GPU and only need to copy the result back. If you try to work with bigger data on the GPU you will very often run into out of memory problems.
Parallization is great if you have big data structures and more than 2 cores in your computer CPU.
If you write it in CUDA it is guaranteed to run in parallel at the chip-level versus going with MATLAB's best guess for a non-parallel architecture and your best effort to get it to run in parallel.
Kind of like drinking fresh mountain water run-off versus buying filtered water. Go with the purist solution.

When does Erlang's parallelism overcome its weaknesses in numeric computing?

With all the hype around parallel computing lately, I've been thinking a lot about parallelism, number crunching, clusters, etc...
I started reading Learn You Some Erlang. As more people are learning (myself included), Erlang handles concurrency in a very impressive, elegant way.
Then the author asserts that Erlang is not ideal for number crunching. I can understand that a language like Erlang would be slower than C, but the model for concurrency seems ideally suited to things like image handling or matrix multiplication, even though the author specifically says its not.
Is it really that bad? Is there a tipping point where Erlang's strength overcomes its local speed weakness? Are/what measures are being taken to deal with speed?
To be clear: I'm not trying to start a debate; I just want to know.
It's a mistake to think of parallelism as only about raw number crunching power. Erlang is closer to the way a cluster computer works than, say, a GPU or classic supercomputer.
In modern GPUs and old-style supercomputers, performance is all about vectorized arithmetic, special-purpose calculation hardware, and low-latency communication between processing units. Because communication latency is low and each individual computing unit is very fast, the ideal usage pattern is to load the machine's RAM up with data and have it crunch it all at once. This processing might involve lots of data passing among the nodes, as happens in image processing or 3D, where there are lots of CPU-bound tasks to do to transform the data from input form to output form. This type of machine is a poor choice when you frequently have to go to a disk, network, or some other slow I/O channel for data. This idles at least one expensive, specialized processor, and probably also chokes the data processing pipeline so nothing else gets done, either.
If your program requires heavy use of slow I/O channels, a better type of machine is one with many cheap independent processors, like a cluster. You can run Erlang on a single machine, in which case you get something like a cluster within that machine, or you can easily run it on an actual hardware cluster, in which case you have a cluster of clusters. Here, communication overhead still idles processing units, but because you have many processing units running on each bit of computing hardware, Erlang can switch to one of the other processes instantaneously. If it happens that an entire machine is sitting there waiting on I/O, you still have the other nodes in the hardware cluster that can operate independently. This model only breaks down when the communication overhead is so high that every node is waiting on some other node, or for general I/O, in which case you either need faster I/O or more nodes, both of which Erlang naturally takes advantage of.
Communication and control systems are ideal applications of Erlang because each individual processing task takes little CPU and only occasionally needs to communicate with other processing nodes. Most of the time, each process is operating independently, each taking a tiny fraction of the CPU power. The most important thing here is the ability to handle many thousands of these efficiently.
The classic case where you absolutely need a classic supercomputer is weather prediction. Here, you divide the atmosphere up into cubes and do physics simulations to find out what happens in each cube, but you can't use a cluster because air moves between each cube, so each cube is constantly communicating with its 6 adjacent neighbors. (Air doesn't go through the edges or corners of a cube, being infinitely fine, so it doesn't talk to the other 20 neighboring cubes.) Run this on a cluster, whether running Erlang on it or some other system, and it instantly becomes I/O bound.
Is there a tipping point where Erlang's strength overcomes its local speed weakness?
Well, of course there is. For example, when trying to find the median of a trillion numbers :) :
http://matpalm.com/median/question.html
Just before you posted, I happened to notice this was the number 1 post on erlang.reddit.com.
Almost any language can be parallelized. In some languages it's simple, in others it's a pain in the butt, but it can be done. If you want to run a C++ program across 8000 CPU's in a grid, go ahead! You can do that. It's been done before.
Erlang doesn't do anything that's impossible in other languages. If a single CPU running an Erlang program is less efficient than the same CPU running a C++ program, then two hundred CPU's running Erlang will also be slower than two hundred CPU's running C++.
What Erlang does do is making this kind of parallelism easy to work with. It saves developer time and reduces the chance of bugs.
So I'm going to say no, there is no tipping point at which Erlang's parallelism allows it to outperform another language's numerical number-crunching strength.
Where Erlang scores is in making it easier to scale out and do so correctly. But it can still be done in other languages which are better at number-crunching, if you're willing to spend the extra development time.
And of course, let's not forget the good old point that languages don't have a speed.
A sufficiently good Erlang compiler would yield perfectly optimal code. A sufficiently bad C compiler would yield code that runs slower than anything else.
There is pressure to make Erlang execute numeric code faster. The HiPe compiler compiles to native code instead of the BEAM bytecode for example, and it probably has its most effective optimization on code on floating points where it can avoid boxing. This is very beneficial for floating point code, since it can store values directly in FPU registers.
For the majority of Erlang usage, Erlang is plenty fast as it is. They use Erlang to write always-up control systems where the most important speed measurement that matters is low latency responses. Performance under load tends to be IO-bound. These users tend to stay away from HiPe since it is not as flexible/malleable in debugging live systems.
Now that servers with 128Gb of RAM are not that uncommon, and there's no reason they'll get even more memory, some IO-bound problems might shift over to be somewhat CPU bound. That could be a driver.
You should follow HiPe for the development.
Your examples of image manipulations and matrix multiplications seem to me as very bad matches for Erlang though. Those are examples that benefit from vector/SIMD operations. Erlang is not good at parallellism (where one does the same thing to multiple values at once).
Erlang processes are MIMD, multiple instructions multiple data. Erlang does lots of branching behind pattern matching and recursive loops. That kills CPU instruction pipelining.
The best architecture for heavily parallellised problems are the GPUs. For programming GPUs in a functional language I see the best potential in using Haskell for creating programs targeting them. A GPU is basically a pure function from input data to output data. See the Lava project in Haskell for creating FPGA circuits, if it is possible to create circuits so cleanly in Haskell, it can't be harder to create program data for GPUs.
The Cell architecture is very nice for vectorizable problems as well.
I think the broader need is to point out that parallelism is not necessarily or even typically about speed.
It is about how to express algorithms or programs in which the sequence of activities is partial-ordered.

Resources