Is physics simulation really faster on GPU? - gpgpu

From what i have observed, havok does a significantly better job for rigid simulation than Physx, especially their new Havok Physics 2013.
Im not very familiar with how state of the art physics engine works, but by testing alone
i cannot get a very accurate test results.
For example, PhysX still seems to cripple CPU performance on purpose. My results shows when the simultaneous interaction rigids exceeds a certain amount (this ranges from 1024 to 8096 boxes), it's performance drops along a very unnatural steep curve and stops plumb dropping when it matches the performance of Bullet. Whereas, many other engines I've tested scale relatively linearly with scene complexity.
This goes worse if I want to measure the real world scenarios such as in games or in game-engines or even in CG making.
Undoubtedly, GPU handles particle physics far better than CPU, So i want to restrict this discussion on rigid body and soft body (including cloth) simulations.
So, is physics simulation really faster on GPU? If it is, by how much?

In general the architecure of a GPU is designed to handle massive data-parallel tasks by having a large number of stream processor cores with very wide SIMD instruction sets. So if the task can be decomposed into similarly structured independant kernels, then a GPU will be much faster (sometimes by orders of magntitude). CPUs do also have multiple cores and SIMD instructions as well, but not as many and not as wide. So it really depends on the specific properties and constraints of the specific workload, and whether or not it can take advantage of this extra parallel architecture.

Related

How to allocate more cpu usage to a program

I am running only one program on my computer to crunch numbers, and it takes up about 25% CPU (all other built-in applications are less than 4% CPU). Since this is the only program I am running, how do I raise the CPU percentage from 25% to 40%? I know changing the priority doesn't really help that much, or the affinity. I am using Windows 10. Thanks for help!
Distributing a demanding computational task (aka number crunching) among multiple processors or cores is usually not a trivial challenge. Likelihood of success depends upon how easy it is to divide the problem into sub-problems, each of which either doesn't need to communicate with each other or need a sufficiently small amount of communication such that communication overhead does not spoil all the speed gain you theoretically could get by using multiple processors.
Being as it is, this is usually a case-by-case decision. If you are lucky there is a special library for your problem domain for you ready to use.
Examples of problems that lend themselves to parallelization (quite) well are
video encoding (different time sections are practically independent of each other and can be encoded separately)
fractals like Mandelbrot (any area of the fractal is completely independent of the others)
explicit structural mechanics equations like crash solvers (solution volumes interact only across surface boundaries, so some communication between processors is necessary, but not much)
Examples of things that don't go so well with parallelization:
dense matrix inversion (maximum dependence of each component from every other component)
implicit structural mechanics equations, like nonlinear equilibrium solution (requires matrix inversion to solve, so same problem as before)
You (probably) can't do that.
The reason that you cannot do that is probably that the program you run is Single-Threaded and you have a Quad-Core-Processor. 25% is a quarter of a whole (processor). This means that one core of your Four-Core-Processor is fully used - resulting in a 25% usage.
Unless you can make your software Multi-Threaded (that means using multiple cores parallely) you are stuck with this limit.

hyperthreading and turbo boost in matrix multiply - worse performance using hyper threading

I am tunning my GEMM code and comparing with Eigen and MKL. I have a system with four physical cores. Until now I have used the default number of threads from OpenMP (eight on my system). I assumed this would be at least as good as four threads. However, I discovered today that if I run Eigen and my own GEMM code on a large dense matrix (1000x1000) I get better performance using four threads instead of eight. The efficiency jumped from 45% to 65%. I think this can be also seen in this plot
https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba
The difference is quite substantial. However, the performance is much less stable. The performance jumps around quit a bit each iteration both with Eigen and my own GEMM code. I'm surprised that Hyperthreading makes the performance so much worse. I guess this is not not a question. It's an unexpected observation which I'm hoping to find feedback on.
I see that not using hyper threading is also suggested here.
How to speed up Eigen library's matrix product?
I do have a question regarding measuring max performance. What I do now is run CPUz and look at the frequency as I'm running my GEMM code and then use that number in my code (4.3 GHz on one overclocked system I use). Can I trust this number for all threads? How do I know the frequency per thread to determine the maximum? How to I properly account for turbo boost?
The purpose of hyperthreading is to improve CPU usage for code exhibiting high latency. Hyperthreading masks this latency by treating two threads at once thus having more instruction level parallelism.
However, a well written matrix product kernel exhibits an excellent instruction level parallelism and thus exploits nearly 100% of the CPU ressources. Therefore there is no room for a second "hyper" thread, and the overhead of its management can only decrease the overall performance.
Unless I've missed something, always possible, your CPU has one clock shared by all its components so if you measure it's rate at 4.3GHz (or whatever) then that's the rate of all the components for which it makes sense to figure out a rate. Imagine the chaos if this were not so, some cores running at one rate, others at another rate; the shared components (eg memory access) would become unmanageable.
As to hyperthreading actually worsening the performance of your matrix multiplication, I'm not surprised. After all, hyperthreading is a poor-person's parallelisation technique, duplicating instruction pipelines but not functional units. Once you've got your code screaming along pushing your n*10^6 contiguous memory locations through the FPUs a context switch in response to a pipeline stall isn't going to help much. At best the other pipeline will scream along for a while before another context switch robs you of useful clock cycles, at worst all the careful arrangement of data in the memory hierarchy will be horribly mangled at each switch.
Hyperthreading is designed not for parallel numeric computational speed but for improving the performance of a much more general workload; we use general-purpose CPUs in high-performance computing not because we want hyperthreading but because all the specialist parallel numeric CPUs have gone the way of all flesh.
As a provider of multithreaded concurrency services, I have explored how hyperthreading affects performance under a variety of conditions. I have found that with software that limits its own high-utilization threads to no more that the actual physical processors available, the presence or absence of HT makes very little difference. Software that attempts to use more threads than that for heavy computational work, is likely unaware that it is doing so, relying on merely the total processor count (which doubles under HT), and predictably runs more slowly. Perhaps the largest benefit that enabling HT may provide, is that you can max out all physical processors, without bringing the rest of the system to a crawl. Without HT, software often has to leave one CPU free to keep the host system running normally. Hyperthreads are just more switchable threads, they are not additional processors.

Is this simulation appropriate for CUDA or OpenCL?

I'm asking on behalf of a friend working in numerical astrophysics.
Basically what he's doing is simulating a cloud of gas. There are a finite number of cells and the timestep is defined such that gas cannot cross more than one cell each step. Each cell has properties like density and temperature. Each timestep, these (and position) need to be calculated. It's mainly position that's the issue I believe as that is affected primarily by the interactions of gravity among the cells, all of which affect each other.
At the moment he's running this on a cluster of ~150 nodes but I wondered, if it's parallelizable like this, could it be run faster on a few GPUs with CUDA? At the moment it takes him a couple of days to finish a simulation. As GPUs generally have ~500 cores, it seemed like they could provide a boost.
Maybe I'm totally wrong.
Yes this sounds like a decent application for a GPU. GPU processing is most effective when it's running the same function on a large data set. If you've already got it running in parallel on a cluster computer, I'd say write it and test it on a single graphics card, and see if that's an improvement on a single cluster, then scale accordingly.
The task you describe is a good fit for the GPU. GPUs have successfully been used for dramatically improving the performance in areas such as particle, aerodynamics and fluid simulations.
Without knowing more details about the simulation it's impossible to say for sure whether it would gain a performance boost. Broadly speaking, algorithms that are memory bound ( that is, relatively few arithmetic operations per memory transaction ) tend to benefit most from offloading to the GPU.
For astrophysics simulations specifically, the following link may be of use : http://www.astrogpu.org/

Algorithms FPGAs dominate CPUs on

For most of my life, I've programmed CPUs; and although for most algorithms, the big-Oh running time remains the same on CPUs / FPGAs, the constants are quite different (for example, lots of CPU power is wasted shuffling data around; whereas for FPGAs it's often compute bound).
I would like to learn more about this -- anyone know of good books / reference papers / tutorials that deals with the issue of:
what tasks do FPGAs dominate CPUs on (in terms of pure speed)
what tasks do FPGAs dominate CPUs on (in terms of work per jule)
Note: marked community wiki
[no links, just my musings]
FPGAs are essentially interpreters for hardware!
The architecture is like dedicated ASICs, but to get rapid development, and you pay a factor of ~10 in frequency and a [don't know, at least 10?] factor in power efficiency.
So take any task where dedicated HW can massively outperform CPUs, divide by the FPGA 10/[?] factors, and you'll probably still have a winner. Typical qualities of such tasks:
Massive opportunities for fine-grained parallelism.
(Doing 4 operations at once doesn't count; 128 does.)
Opportunity for deep pipelining.
This is also a kind of parallelism, but it's hard to apply it to a
single task, so it helps if you can get many separate tasks to
work on in parallel.
(Mostly) Fixed data flow paths.
Some muxes are OK, but massive random accesses are bad, cause you
can't parallelize them. But see below about memories.
High total bandwidth to many small memories.
FPGAs have hundreds of small (O(1KB)) internal memories
(BlockRAMs in Xilinx parlance), so if you can partition you
memory usage into many independent buffers, you can enjoy a data
bandwidth that CPUs never dreamed of.
Small external bandwidth (compared to internal work).
The ideal FPGA task has small inputs and outputs but requires a
lot of internal work. This way your FPGA won't starve waiting for
I/O. (CPUs already suffer from starving, and they alleviate it
with very sophisticated (and big) caches, unmatchable in FPGAs.)
It's perfectly possible to connect a huge I/O bandwidth to an
FPGA (~1000 pins nowdays, some with high-rate SERDESes) -
but doing that requires a custom board architected for such
bandwidth; in most scenarios, your external I/O will be a
bottleneck.
Simple enough for HW (aka good SW/HW partitioning).
Many tasks consist of 90% irregular glue logic and only 10%
hard work ("kernel" in the DSP sense). If you put all that
onto an FPGA, you'll waste precious area on logic that does no
work most of the time. Ideally, you want all the muck
to be handled in SW and fully utilize the HW for the kernel.
("Soft-core" CPUs inside FPGAs are a popular way to pack lots of
slow irregular logic onto medium area, if you can't offload it to
a real CPU.)
Weird bit manipulations are a plus.
Things that don't map well onto traditional CPU instruction sets,
such as unaligned access to packed bits, hash functions, coding &
compression... However, don't overestimate the factor this gives
you - most data formats and algorithms you'll meet have already
been designed to go easy on CPU instruction sets, and CPUs keep
adding specialized instructions for multimedia.
Lots of Floating point specifically is a minus because both
CPUs and GPUs crunch them on extremely optimized dedicated silicon.
(So-called "DSP" FPGAs also have lots of dedicated mul/add units,
but AFAIK these only do integers?)
Low latency / real-time requirements are a plus.
Hardware can really shine under such demands.
EDIT: Several of these conditions — esp. fixed data flows and many separate tasks to work on — also enable bit slicing on CPUs, which somewhat levels the field.
Well the newest generation of Xilinx parts just anounced brag 4.7TMACS and general purpose logic at 600MHz. (These are basically Virtex 6s fabbed on a smaller process.)
On a beast like this if you can implement your algorithms in fixed point operations, primarily multiply, adds and subtracts, and take advantage of both Wide parallelism and Pipelined parallelism you can eat most PCs alive, in terms of both power and processing.
You can do floating on these, but there will be a performance hit. The DSP blocks contain a 25x18 bit MACC with a 48bit sum. If you can get away with oddball formats and bypass some of the floating point normalization that normally occurs you can still eek out a truck load of performance out of these. (i.e. Use the 18Bit input as strait fixed point or float with a 17 bit mantissia, instead of the normal 24 bit.) Doubles floats are going to eat alot of resources so if you need that, you probably will do better on a PC.
If your algorithms can be expressed as in terms of add and subtract operations, then the general purpose logic in these can be used to implement gazillion adders. Things like Bresenham's line/circle/yadda/yadda/yadda algorithms are VERY good fits for FPGA designs.
IF you need division... EH... it's painful, and probably going to be relatively slow unless you can implement your divides as multiplies.
If you need lots of high percision trig functions, not so much... Again it CAN be done, but it's not going to be pretty or fast. (Just like it can be done on a 6502.) If you can cope with just using a lookup table over a limited range, then your golden!
Speaking of the 6502, a 6502 demo coder could make one of these things sing. Anybody who is familiar with all the old math tricks that programmers used to use on the old school machine like that will still apply. All the tricks that modern programmer tell you "let the libary do for you" are the types of things that you need to know to implement maths on these. If yo can find a book that talks about doing 3d on a 68000 based Atari or Amiga, they will discuss alot of how to implement stuff in integer only.
ACTUALLY any algorithms that can be implemented using look up tables will be VERY well suited for FPGAs. Not only do you have blockrams distributed through out the part, but the logic cells themself can be configured as various sized LUTS and mini rams.
You can view things like fixed bit manipulations as FREE! It's simply handle by routing. Fixed shifts, or bit reversals cost nothing. Dynamic bit operations like shift by a varable amount will cost a minimal amount of logic and can be done till the cows come home!
The biggest part has 3960 multipliers! And 142,200 slices which EACH one can be an 8 bit adder. (4 6Bit Luts per slice or 8 5bit Luts per slice depending on configuration.)
Pick a gnarly SW algorithm. Our company does HW acceleration of SW algo's for a living.
We've done HW implementations of regular expression engines that will do 1000's of rule-sets in parallel at speeds up to 10Gb/sec. The target market for that is routers where anti-virus and ips/ids can run real-time as the data is streaming by without it slowing down the router.
We've done HD video encoding in HW. It used to take several hours of processing time per second of film to convert it to HD. Now we can do it almost real-time...it takes almost 2 seconds of processing to convert 1 second of film. Netflix's used our HW almost exclusively for their video on demand product.
We've even done simple stuff like RSA, 3DES, and AES encryption and decryption in HW. We've done simple zip/unzip in HW. The target market for that is for security video cameras. The government has some massive amount of video cameras generating huge streams of real-time data. They zip it down in real-time before sending it over their network, and then unzip it in real-time on the other end.
Heck, another company I worked for used to do radar receivers using FPGA's. They would sample the digitized enemy radar data directly several different antennas, and from the time delta of arrival, figure out what direction and how far away the enemy transmitter is. Heck, we could even check the unintended modulation on pulse of the signals in the FPGA's to figure out the fingerprint of specific transmitters, so we could know that this signal is coming from a specific Russian SAM site that used to be stationed at a different border, so we could track weapons movements and sales.
Try doing that in software!! :-)
For pure speed:
- Paralizable ones
- DSP, e.g. video filters
- Moving data, e.g. DMA

When does Erlang's parallelism overcome its weaknesses in numeric computing?

With all the hype around parallel computing lately, I've been thinking a lot about parallelism, number crunching, clusters, etc...
I started reading Learn You Some Erlang. As more people are learning (myself included), Erlang handles concurrency in a very impressive, elegant way.
Then the author asserts that Erlang is not ideal for number crunching. I can understand that a language like Erlang would be slower than C, but the model for concurrency seems ideally suited to things like image handling or matrix multiplication, even though the author specifically says its not.
Is it really that bad? Is there a tipping point where Erlang's strength overcomes its local speed weakness? Are/what measures are being taken to deal with speed?
To be clear: I'm not trying to start a debate; I just want to know.
It's a mistake to think of parallelism as only about raw number crunching power. Erlang is closer to the way a cluster computer works than, say, a GPU or classic supercomputer.
In modern GPUs and old-style supercomputers, performance is all about vectorized arithmetic, special-purpose calculation hardware, and low-latency communication between processing units. Because communication latency is low and each individual computing unit is very fast, the ideal usage pattern is to load the machine's RAM up with data and have it crunch it all at once. This processing might involve lots of data passing among the nodes, as happens in image processing or 3D, where there are lots of CPU-bound tasks to do to transform the data from input form to output form. This type of machine is a poor choice when you frequently have to go to a disk, network, or some other slow I/O channel for data. This idles at least one expensive, specialized processor, and probably also chokes the data processing pipeline so nothing else gets done, either.
If your program requires heavy use of slow I/O channels, a better type of machine is one with many cheap independent processors, like a cluster. You can run Erlang on a single machine, in which case you get something like a cluster within that machine, or you can easily run it on an actual hardware cluster, in which case you have a cluster of clusters. Here, communication overhead still idles processing units, but because you have many processing units running on each bit of computing hardware, Erlang can switch to one of the other processes instantaneously. If it happens that an entire machine is sitting there waiting on I/O, you still have the other nodes in the hardware cluster that can operate independently. This model only breaks down when the communication overhead is so high that every node is waiting on some other node, or for general I/O, in which case you either need faster I/O or more nodes, both of which Erlang naturally takes advantage of.
Communication and control systems are ideal applications of Erlang because each individual processing task takes little CPU and only occasionally needs to communicate with other processing nodes. Most of the time, each process is operating independently, each taking a tiny fraction of the CPU power. The most important thing here is the ability to handle many thousands of these efficiently.
The classic case where you absolutely need a classic supercomputer is weather prediction. Here, you divide the atmosphere up into cubes and do physics simulations to find out what happens in each cube, but you can't use a cluster because air moves between each cube, so each cube is constantly communicating with its 6 adjacent neighbors. (Air doesn't go through the edges or corners of a cube, being infinitely fine, so it doesn't talk to the other 20 neighboring cubes.) Run this on a cluster, whether running Erlang on it or some other system, and it instantly becomes I/O bound.
Is there a tipping point where Erlang's strength overcomes its local speed weakness?
Well, of course there is. For example, when trying to find the median of a trillion numbers :) :
http://matpalm.com/median/question.html
Just before you posted, I happened to notice this was the number 1 post on erlang.reddit.com.
Almost any language can be parallelized. In some languages it's simple, in others it's a pain in the butt, but it can be done. If you want to run a C++ program across 8000 CPU's in a grid, go ahead! You can do that. It's been done before.
Erlang doesn't do anything that's impossible in other languages. If a single CPU running an Erlang program is less efficient than the same CPU running a C++ program, then two hundred CPU's running Erlang will also be slower than two hundred CPU's running C++.
What Erlang does do is making this kind of parallelism easy to work with. It saves developer time and reduces the chance of bugs.
So I'm going to say no, there is no tipping point at which Erlang's parallelism allows it to outperform another language's numerical number-crunching strength.
Where Erlang scores is in making it easier to scale out and do so correctly. But it can still be done in other languages which are better at number-crunching, if you're willing to spend the extra development time.
And of course, let's not forget the good old point that languages don't have a speed.
A sufficiently good Erlang compiler would yield perfectly optimal code. A sufficiently bad C compiler would yield code that runs slower than anything else.
There is pressure to make Erlang execute numeric code faster. The HiPe compiler compiles to native code instead of the BEAM bytecode for example, and it probably has its most effective optimization on code on floating points where it can avoid boxing. This is very beneficial for floating point code, since it can store values directly in FPU registers.
For the majority of Erlang usage, Erlang is plenty fast as it is. They use Erlang to write always-up control systems where the most important speed measurement that matters is low latency responses. Performance under load tends to be IO-bound. These users tend to stay away from HiPe since it is not as flexible/malleable in debugging live systems.
Now that servers with 128Gb of RAM are not that uncommon, and there's no reason they'll get even more memory, some IO-bound problems might shift over to be somewhat CPU bound. That could be a driver.
You should follow HiPe for the development.
Your examples of image manipulations and matrix multiplications seem to me as very bad matches for Erlang though. Those are examples that benefit from vector/SIMD operations. Erlang is not good at parallellism (where one does the same thing to multiple values at once).
Erlang processes are MIMD, multiple instructions multiple data. Erlang does lots of branching behind pattern matching and recursive loops. That kills CPU instruction pipelining.
The best architecture for heavily parallellised problems are the GPUs. For programming GPUs in a functional language I see the best potential in using Haskell for creating programs targeting them. A GPU is basically a pure function from input data to output data. See the Lava project in Haskell for creating FPGA circuits, if it is possible to create circuits so cleanly in Haskell, it can't be harder to create program data for GPUs.
The Cell architecture is very nice for vectorizable problems as well.
I think the broader need is to point out that parallelism is not necessarily or even typically about speed.
It is about how to express algorithms or programs in which the sequence of activities is partial-ordered.

Resources