Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
The community reviewed whether to reopen this question 2 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
How are GPUs more faster then CPUs? I've read articles that talk about how GPU's are much faster in breaking passwords than CPUs. If thats the case then why can't CPUs be designed in the same way as GPUs to be even in speed?
GPU get their speed for a cost. A single GPU core actually works much slower than a single CPU core. For example, Fermi GTX 580 has a core clock of 772MHz. You wouldn't want your CPU with such a low core clock nowadays...
The GPU however has several cores (up to 16) each operating in a 32-wide SIMD mode. That brings 500 operations done in parallel. Common CPUs however have up to 4 or 8 cores and can operate in 4-wide SIMD which gives much lower parallelism.
Certain types of algorithms (graphics processing, linear algebra, video encoding, etc...) can be easily parallelized on such a huge number of cores. Breaking passwords falls into that category.
Other algorithms however are really hard to parallelize. There is ongoing research in this area... Those algorithms would perform really badly if they were run on the GPU.
The CPU companies are now trying to approach the GPU parallelism without sacrificing the capability of running single-threaded programs. But the task is not an easy one. The Larabee project (currently abandoned) is a good example of the problems. Intel has been working on it for years but it is still not available on the market.
GPUs are designed with one goal in mind: process graphics really fast. Since this is the only concern they have, there have been some specialized optimizations in place that allow for certain calculations to go a LOT faster than they would in a traditional processor.
In the case of password cracking (or the molecular dynamic "folding at home" project) what has happened is that programmers have found ways of leveraging these optimized processes to do things like crunch passwords at a faster rate.
Your standard CPU has to do a lot more different calculation and processing types that what graphics processors do, so they can't be optimized in a similar manner.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I was watching a video about .kkrieger fps game (https://www.youtube.com/watch?v=bD1wWY1YD-M) and was astounded by their incredible work to put such a complex game for such an insane small memory size (96kB). However, they consume a huge amount of CPU and GPU processing.
Then it came the following question: is it possible to develop a graphics programming engine/framework/tool for high performance and high fps without relying much on high-end CPU + GPU processing power? I am not thinking of less ROM memory in this question, but less CPU+GPU processing power to improve the fps.
As #Nicol-Bolas pointed out, there are many ways to see the question and my question was too broad or unfocused, so it is defined in terms of having an engine or own made code for high resolution and high fps settings without having a high CPU + GPU specs combo.
Computers are not magic. Everything they do has to come from somewhere and be the result of some process.
It is impressive to be able to generate interesting assets from algorithms. But this is a memory vs. performance tradeoff: you are exchanging small memory sizes for using up processing power to generate those assets. Essentially, algorithmic generation can be thought of as a form of data compression. And generally speaking, the bigger your compression ratios, the longer it will take to decompress the data.
If you want more stuff, it's going to cost you something. They choose to optimize for disk storage space, and that has costs for runtime memory and performance.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm very interested in learning FPGA development. I've found a bunch of "getting started with FPGA" questions here, and other tutorials and resources on the internet. But I'm primarily interested in using FPGAs as an accelerator, and I can't figure out what devices will actually offer a speed up over a desktop CPU (say a recent i7).
My particular interest at the moment is cellular automata (and other parallel environments like neural networks and agent based modeling). I'd like to experiment with 3d or higher dimensional cellular automatas. My question is - will the low-cost $100-$200 starter kits provide something that have potential to produce a significant speed up over a desktop CPU? Or would I need to spend more and get a higher end model FPGA?
The FPGA can be a very good accelerator, but (and this is a big BUG) it is usually is very expensive. We have here machines like the beecube, a convey or from Dini godzillas part time nanny, and they are all very expensive (>10k$) and even with these machines many applications can be better accelerated with a standard cpu cluster or gpus. That FPGA is a bit better when the total cost of ownership is considered as you have there usually a better engery efficiency.
But there are applications that you can accelerate. On the lower scale you can/should do an rough estimate if its worth for you application, but you need there more concrete numbers for your application. Consider an standard deskop cpu: usually it has at least 4 cores (or dual with hyperthreading, not to mention the vector units), and clocks at, say 3 GHz. This results in 12 GCycles per second computation power. The (cheap) FPGAs you can get to 250 MHz (better can reach up to 500 MHz, but that must be very friendly designs and very good speed grades), so you need approx. 50 Operations in parallel, to compete with the CPU (actually its a bit better because the cpu has usually not 1 cycle ops, but it also has vector operations so we are equal).
50 Operations sounds much, and is hard, but is doable (the magic word here is pipeling). So you should know exactly how you are going to implement you design in hardware and which degree of parallelism you can use.
Even if you solve that parallelism problem, we come now to the real problem: The memory.
Above mentioned accelerators have so much compute capacity, they could do thousands things in parallel, but the real problem with such computation power is: how to get the data into/out of them. And you also have this problem in your small scale. In your desktop PC the cpu transfers more than 20GB/s to/from memory (good GPU card make 100GB/s and more), while your small accelerator for 100-200$ has at most (when you get lucky) 1-2 GB/s per PCI-Exp.
If its worth for you, depends completely on your application (and here you need far more details than: 3D Cellular Automatas, you must know the neighbourhoods, the required precision (do you double, single float, or integers or fixpoint...?), and your use case (do you transfer initial cell values, let the machine compute 2 day, and than transfer cell values back, or do you need the cell values after every step (this makes a huge difference in the required bandwidth while computation)).
But overall, without knowing more, I would say: Its 100$-200$ worth.
But not because you can compute your cellular automatas faster (which I dont believe), but because you will lern. And you will not only learn to design hardware and the development on FPGAs, but I see with our students that we have here, always that they get with the hardware design knowledge also a far better understanding on how the hardware actually look and behaves like. Sure nothing what you do on your FPGA is direct related to the interior of the cpu, but many get a better feeling for what hardware in general is capable of, which in turn make them even more effective software developer.
But I have also to admit: You are going to pay a much higher price than just the 100-200$: You have to spent really much time on it.
Disclaimer: I work for a reconfigurable system developer/manufacturer.
A short answer to your question "will the low-cost $100-$200 starter kits provide something that have potential to produce a significant speed up over a desktop CPU" is probably not.
A longer answer:
A microprocessor is a set of fixed, shared functional units tuned to perform reasonably well across a broad range of applications. The operating system and compilers do a good job of making sure that these fixed, shared functional units are utilized appropriately.
FPGA based systems get their performance from dedicated, dense, computational efficiency. You create exactly what you need to execute your application, no more, no less - and whatever you create is not shared with any other user, process, operating system, whatever. If you need 80 floating point units, you create 80 dedicated floating point units that run in parallel. Compare that to a microprocessor scheduling floating point operations across some smaller number of floating point units. To get performance faster than a microprocessor, you have to instantiate enough dedicated FPGA-based functional units to make a performance difference vs. a microprocessor. This often requires the resources in the larger FPGA devices.
An FPGA alone is not enough. If you create a large number of efficient computational engines in an FPGA you -have- to keep these engines fed with data. This requires some number of high bandwidth connections to large amounts of data memory around the FPGA. What you often see with I/O based FPGA cards is some of the potential performance gain is usually diminished by moving data back and forth across the I/O bus.
As a data point, my company uses the '530 Stratix IV FPGA from Altera. We surround it with several directly coupled memories and tie this subsystem directly into the microprocessor memory. We get several advantages over microprocessor systems for many applications, but this is not a $100-$200 starter kit, this is a full-blown integrated system.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am looking for rules of thumb for designing algorithms where the data is accessed slowly due to limitations of disk speed, pci speed(gpgpu) or other bottleneck.
Also, how does one manage gpgpu programs where the memory of the application exceeds gpgpu memory?
In general, the GPU memory should not be an arbitrary limitation on the size of data for algorithms. The GPU memory could be considered to be a "cache" of data that the GPU is currently operating on, but many GPU algorithms are designed to operate on more data than can fit in the "cache". This is accomplished by moving data to and from the GPU while computation is going on, and the GPU has specific concurrent execution and copy/compute overlap mechanisms to enable this.
This usually implies that independent work can be completed on sections of the data, which is typically a good indicator for acceleration in a parallelizable application. Conceptually, this is similar to large scale MPI applications (such as high performance linpack) which break the work into pieces and then send the pieces to various machines (MPI ranks) for computation.
If the amount of work to be done on the data is small compared to the cost to transfer the data, then the data transfer speed will still become the bottleneck, unless it is addressed directly via changes to the storage system.
The basic approach to handling out-of-core or algorithms where the data set is too large to fit in GPU memory all at once is to determine a version of the algorithm which can work on separable data, and then craft a "pipelined" algorithm to work on the data in chunks. An example tutorial which covers such a programming technique is here (focus starts around 40 minute mark, but the whole video is relevant).
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
Simple question, increasing core directly related to performance?
My understanding (kindly correct me if i am wrong) is in multi-core systems, communication overhead and memory
latencies are a limiting factor in performance as compared to single core. Perhaps a single core system with large L1 and L2 cache can perform much better then Core 2 Duos? But then why in almost every new architecture number of cores are being increased. There must be reason which i am here to know.
Thanks for help!
Generally memory latency nor bandwidth are an issues when scaling up the # of cores in a system. Note: There are probably specialized exceptions, but by and large most modern systems don't start running into memory bottlenecks until 6+ hardware cores are accessing memory.
Communication overhead, however, can be devastatingly expensive. The technical reasons for this are extremely complicated and beyond the scope of my answer -- some aspects are related to hardware but others are simply related to the cost of one core blocking for another to finish its calculations .. both are bad. Because of this, programs/applications that utilize multiple cores typically must do so with as little communication between cores as possible. This limits the types of tasks that can off-loaded onto separate cores.
New systems are adding more cores simply because it is technologically feasible. Eg, increasing single core performance is neither technically nor economically viable anymore. Almost all applications programmers I know would absolutely prefer a single ultra-fast core over having to figure out how to efficiently utilize 12 cores. But the chip manufacturers couldn't produce such a core even if you granted them tens of millions of dollars.
As long as the speed of light is a fixed constant, parallel processing will be here to stay. As it is today, much of the speed improvement found in CPUs is due to parallel processing of individual instructions. As much as is possible, a Core 2 Duo (for example) will run up to four instructions in parallel. This works because in many programs sequences of instructions are often not immediately dependent on each other:
a = g_Var1 + 1;
b = g_Var2 + 3;
c = b * a;
d = g_Var3 + 5;
Modern CPUs will actually execute lines 1,2, AND 4 in parallel, and then double back and finish up line 3 -- usually in parallel with whatever comes in lines 5,6,etc. (assuming the 'c' variable result isn't needed in any of them). This is needed because our ability to speed up or shorten the pipeline for executing any single instruction is very limited. So instead engineers have been focusing on "going wide" -- more instructions in parallel, more cores in parallel, more computers in parallel (the last being similar to cloud computing, BOINC or #home projects).
It depends on your software. If you have CPU intensive calculation tasks that don't use much external communication and require parallel processing - multi-core is the way to go to scale vertically. It will perform much better to compare with single core CPU...since it can perform calculation tasks in parallel (again depends on your particular task(s) that take advantage of paralleled execution). For example DB servers usually take advantage of parallel processing and scale greatly on multi-core CPUs.
Once vertical limit exhausted, you can scale horizontally by introducing multiple nodes in your cluster and you would need to coordinate task execution.
So to your question:
But then why in almost every new architecture number of cores are
being increased.
One of the reasons is that software evolves to take advantage of parallel processing and hardware trying to satisfy this hunger.
You're assuming that cores can become usefully more complex. At this point, that's not a safe assumption.
You can either execute more instructions at once ("wider") or pipeline more for higher frequencies ("deeper").
Both of these approaches get diminishing returns. Wider chips rely on parallelism being available at the instruction level, which it largely isn't beyond about 3-wide in the best cases and ~1 typically. Deeper chips have power and heat issues (power typically scales quadraticaly with frequency due to voltage increases, while scaling linearly with core count) and hurt branch mispredict recovery time.
We do multi core chips not because we want to, but because we're out of better alternatives.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I know many examples when GPU is much faster than CPU. But exists algorithms (problems) which are very hard to parallelise. Could you give me some examples or tests when CPU can overcome GPU ?
Edit:
Thanks for suggestions! We can make a comparison between the most popular and the newest cpu's and gpu's, for example Core i5 2500k vs GeForce GTX 560 Ti.
I wonder how to compare SIMD model between them. For example: Cuda calls a SIMD model more precisely SIMT. But SIMT should be compared to the multhitreading on CPU's which is distributing threads (tasks) between MIMD cores (Core i5 2500k give as 4 MIMD cores). On the other hand each of these MIMD cores can implement SIMD model, but this is something else than SIMT and I don't know how to compare them. Finally a fermi architecture with concurrent kernel execution might be consider as MIMD cores with SIMT.
Based on my experience, I will summarize the key differences in terms of performance between parallel programs in CPUs and GPUs. Trust me, a comparison can be changed from generation to generation. So I will just point out what is good and is bad for CPUs and GPUs. Of course, if you make a program at the extreme, i.e., having only bad or good sides, it will run definitely faster on one platform. But a mixture of those requires very complicated reasoning.
Host program level
One key difference is memory transfer cost. GPU devices requires some memory transfers. This cost is non-trivial in some cases, for example when you have to frequently transfer some big arrays. In my experience, this cost can be minimized but pushing most of host code to device code. The only cases you can do so are when you have to interact with the host operating system in program, such as outputting to monitor.
Device program level
Now we come to see a complex picture that hasn't been fully revealed yet. What I mean is there are many mysterious scenes in GPUs that haven't been disclosed. But still, we have a lot of distinguish CPU and GPU (kernel code) in terms of performance.
There are few factors that I noticed those dramatically contribute to the difference.
Workload distribution
GPUs, which consist of many execution units, are designed to handle massively parallel programs. If you have little of work, say a few sequential tasks, and put these tasks on a GPU, only a few of those many execution units are busy, thus will be slower than CPU. Because CPUs are, in other hand, better to handle short and sequential tasks. The reason is simple, CPUs are much more complicated and able to exploit instruction level parallelism, whereas GPUs exploit thread level parallelism. Well, I heard NVIDIA GF104 can do Superscalar, but I had no chance to experience with it though.
It is worth noting that, in GPUs, workload are divided into small blocks (or workgroups in OpenCL), and blocks are arranged in chunks, each of which is executed in one Streaming processor (I am using terminologies from NVIDIA). But in CPUs, those blocks are executed sequentially - I can't think of anything else than a single loop.
Thus, for programs that have small number of blocks, it will be likely to run faster on CPUs.
Control flow instructions
Branches are bad things to GPUs, always. Please bear in mind that GPUs prefer equal things. Equal blocks, equal threads within a blocks, and equal threads within a warp. But what matters the most?
***Branch divergences.***
Cuda/OpenCL programmers hate branch divergences. Since all the threads somehow are divided into sets of 32 threads, called a warp, and all threads within a warp execute in lockstep, a branch divergence will cause some threads in the warp to be serialized. Thus, the execution time of the warp will be accordingly multiplied.
Unlike GPUs, each cores in CPUs can follow their own path. Furthermore, branches can be efficiently executed because CPUs have branch prediction.
Thus, programs that have more warp divergences are likely to run faster on CPUs.
Memory access instructions
This REALLY is complicated enough so let's make it brief.
Remember that global memory accesses have very high latency (400-800 cycles). So in old generations of GPUs, whether memory accesses are coalesced was a critical matter. Now your GTX560 (Fermi) has more 2 level of caches. So global memory access cost can be reduced in many cases. However, caches in CPUs and GPUs are different, so their effects are also different.
What I can say is that it really really depends on your memory access pattern, your kernel code pattern (how memory accesses are interleaved with computation, the types of operations, etc., ) to tell if one runs faster on GPUs or CPUs.
But somehow you can expect a huge number of cache misses (in GPUs) has a very bad effect on GPUs (how bad? - it depends on your code).
Additionally, shared memory is an important feature of GPUs. Accessing to shared memory is as fast as accessing to GPU L1 cache. So kernels that make use of shared memory will have pretty much benefit.
Some other factors I haven't really mentioned but those can have big impact on the performance in many cases such as bank conflicts, size of memory transaction, GPU occupancy...