Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
Simple question, increasing core directly related to performance?
My understanding (kindly correct me if i am wrong) is in multi-core systems, communication overhead and memory
latencies are a limiting factor in performance as compared to single core. Perhaps a single core system with large L1 and L2 cache can perform much better then Core 2 Duos? But then why in almost every new architecture number of cores are being increased. There must be reason which i am here to know.
Thanks for help!
Generally memory latency nor bandwidth are an issues when scaling up the # of cores in a system. Note: There are probably specialized exceptions, but by and large most modern systems don't start running into memory bottlenecks until 6+ hardware cores are accessing memory.
Communication overhead, however, can be devastatingly expensive. The technical reasons for this are extremely complicated and beyond the scope of my answer -- some aspects are related to hardware but others are simply related to the cost of one core blocking for another to finish its calculations .. both are bad. Because of this, programs/applications that utilize multiple cores typically must do so with as little communication between cores as possible. This limits the types of tasks that can off-loaded onto separate cores.
New systems are adding more cores simply because it is technologically feasible. Eg, increasing single core performance is neither technically nor economically viable anymore. Almost all applications programmers I know would absolutely prefer a single ultra-fast core over having to figure out how to efficiently utilize 12 cores. But the chip manufacturers couldn't produce such a core even if you granted them tens of millions of dollars.
As long as the speed of light is a fixed constant, parallel processing will be here to stay. As it is today, much of the speed improvement found in CPUs is due to parallel processing of individual instructions. As much as is possible, a Core 2 Duo (for example) will run up to four instructions in parallel. This works because in many programs sequences of instructions are often not immediately dependent on each other:
a = g_Var1 + 1;
b = g_Var2 + 3;
c = b * a;
d = g_Var3 + 5;
Modern CPUs will actually execute lines 1,2, AND 4 in parallel, and then double back and finish up line 3 -- usually in parallel with whatever comes in lines 5,6,etc. (assuming the 'c' variable result isn't needed in any of them). This is needed because our ability to speed up or shorten the pipeline for executing any single instruction is very limited. So instead engineers have been focusing on "going wide" -- more instructions in parallel, more cores in parallel, more computers in parallel (the last being similar to cloud computing, BOINC or #home projects).
It depends on your software. If you have CPU intensive calculation tasks that don't use much external communication and require parallel processing - multi-core is the way to go to scale vertically. It will perform much better to compare with single core CPU...since it can perform calculation tasks in parallel (again depends on your particular task(s) that take advantage of paralleled execution). For example DB servers usually take advantage of parallel processing and scale greatly on multi-core CPUs.
Once vertical limit exhausted, you can scale horizontally by introducing multiple nodes in your cluster and you would need to coordinate task execution.
So to your question:
But then why in almost every new architecture number of cores are
being increased.
One of the reasons is that software evolves to take advantage of parallel processing and hardware trying to satisfy this hunger.
You're assuming that cores can become usefully more complex. At this point, that's not a safe assumption.
You can either execute more instructions at once ("wider") or pipeline more for higher frequencies ("deeper").
Both of these approaches get diminishing returns. Wider chips rely on parallelism being available at the instruction level, which it largely isn't beyond about 3-wide in the best cases and ~1 typically. Deeper chips have power and heat issues (power typically scales quadraticaly with frequency due to voltage increases, while scaling linearly with core count) and hurt branch mispredict recovery time.
We do multi core chips not because we want to, but because we're out of better alternatives.
Related
Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 months ago.
Improve this question
CPUs are still "improving", but their frequency haven't improved a lot during the last 10 years.
I can understand that the transistor count increases because of smaller and smaller transistors, but I don't understand how a non-parallel program (most programs are non-parallel, I think?) can be executed much faster on new CPUs if the frequency doesn't increase.
I can understand why GPU can be faster with more transistors because they're parallel processors (is that the right term?) and they only execute parallel code.
But most software is non parallel, so to me it seems that new CPU should not become much faster than previous CPUs, unless most programs can be parallelized, which is not the case (I'm not sure, but what are typical algorithms that can't be parallelized?).
Are larger L1/L2/L3 cache sizes allowing new CPU to be faster? Or are there other things like new instructions or branching things?
What am I missing?
More and more programs are using threads for things that can be parallelized reasonably. But you're asking about single-threaded (or per-core) performance, and that's totally fine and interesting.
You're missing instruction-level parallelism (ILP) and increasing IPC (instructions per cycle).
Also, SIMD (x86's SSE, AVX, AVX-512, or ARM NEON, SVE) to get more work done per instruction, exploiting data-parallelism in a (potentially small) loop that way instead of or as well as with threading. But that isn't a big factor for many applications.
Work per clock is instructions/cycle x work/insn x threads. (threads is basically number of clocks ticking at once, if your program is running on multiple cores). Even if threads is 1, the other two factors can increase.
A problem with lots of data parallelism (e.g. summing an array, or adding 1 to every element) can expose that parallelism to the CPU in three ways: SIMD, instruction-level parallelism (e.g. unroll with multiple accumulators if there's a dependency chain like a sum), and thread-level parallelism.
These are all orthogonal. And some of them apply to problem that aren't data parallel, just different steps of a complicated program. IPC applies all the time. With good enough branch prediction, CPUs can see far ahead in the instruction stream and finding parallel work to do (especially memory-level parallelism), as long as the code isn't doing something like traverse a linked list where the next load address depends on the current load result. Then you bottleneck on load latency, with no memory-level parallelism (except for whatever work you're doing on each node.)
Some major factors
Larger caches improve hit rates and effective bandwidth, leading to fewer stalls. That raises average IPC. (Also smarter cache-replacement algorithms, like L3 adaptive replacement in Ivy Bridge.)
Actual DRAM bandwidth increases help, too, (especially with good HW prefetching), but DRAM B/W is shared between cores. L1/L2 cache are private in modern CPUs, and L3 bandwidth scales nicely as well with different cores accessing different parts of it. Still, DRAM often comes into play, especially in code that isn't carefully tuned for cache-blocking. DRAM latency is near constant (in absolute nanoseconds, so getting "worse" in core clock cycles), but memory clocks have been climbing significantly in the past decade.
Larger ReOrder Buffers (ROB) and schedulers (RS) allow CPUs to find ILP over larger windows. Similarly, larger load and store buffers allow more memory-level parallelism, e.g. tracking more in-flight cache-miss loads in parallel. And having a larger buffer before you have to stall if a store misses in cache.
Better branch prediction reduces how often this speculative work has to be discarded if the CPU finds it had guessed the wrong path for an earlier branch.
Wider pipelines allow higher peak IPC. At best, in high-throughput code (not a lot of stalls, and lots of ILP), this can be sustained.
Otherwise, it at least helps get to the next stall sooner, doing a burst of work. And to clear out instructions waiting in the ROB when a cache-miss load does finally arrive, making room for new work to potentially see some independent work later. If execution of a loop condition can get far ahead of the actual work in the loop, a mispredict of the loop exit branch might be resolved before the back-end runs out of work to do. So a max IPC higher than the steady-state bottleneck of a loop is useful for loops that aren't infinite.
See also
Modern Microprocessors A 90-Minute Guide! - covers this quite well, how clocks mostly stopped increasing steadily once we hit the "power wall" in Pentium 4. As ever more efficient CPUs are designed, clocks are creeping up again, especially with fine-grained clock gating to stop heat generation from parts of a CPU that aren't doing anything in any given clock cycle.
This allows high turbo clocks when running "inefficient" code that bottlenecks on cache misses, branch misses, and other things like that so there aren't a lot of execution units busy at once.
How does a single thread run on multiple cores? - it doesn't, that's not what IPC is about. My answer there attempts to explain it so a beginner can understand.
https://www.realworldtech.com/sandy-bridge/ A deep dive on Sandybridge, how it finds instruction-level parallelism.
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? (much more technical) - what to look for when predicting how fast a CPU might run a given loop or sequence of assembly instructions.
My question primarily applies to firestorm/icestorm (because that's the hardware I have), but I am curious about what other representative arm cores do too. Arm has strange pre- and post-incremented addressing modes. If I have (for instance) two post-incremented loads from the same register, will the second depend on the first, or is the CPU smart enough to perform them in parallel?
AFAIK the exact behaviour of the M1 execution units is mainly undocumented. Still, there is certainly a dependency chain in this case. In fact, it would be very hard to break it and the design of modern processors make this even harder: the decoders, execution units, schedulers are distinct units and it would be insane to dynamically adapt the scheduling based on the instructions executed in parallel by execution units so to be able to break the chain in this particular case. Not to mention that instructions are pipelined and it generally takes few cycles for them to be committed. Furthermore, the time of the instructions is variable based on the fetched memory location. Finally, even this would be the case, the Firestorm documents does not mention such a feedback loop (see below for the links). Another possible solution for a processor to optimize such a pattern is to fuse the microinstructions so to combine the increment and add more parallelism but this is pretty complex to do for a relatively small improvement and there is no evidence showing Firestorm can do that so far (see here for more information about Firestorm instruction fusion/elimitation).
The M1 big cores (Apple's Firestorm) are designed to be massively parallel. They have 6 ALUs per core so they can execute a lot instructions in parallel on each core (possibly at the expense of a higher latency). However, this design tends to require a lot more transistors than current mainstream x86 Intel/AMD alternative (Alderlake/XX-Cove architecture put aside). Thus, the cores operate at a significantly lower frequency so to keep the energy consumption low. This means dependency chains are significantly more expensive on such an architecture compared to others unless there are enough independent instructions to be execute in parallel on the critical path. For more information about how CPUs works please thread Modern Microprocessors - A 90-Minute Guide!. For more information about the M1 processors and especially the Firestorm architecture, please read this deep analysis.
Note that Icestorm cores are designed to be energy efficient so they are far less parallel and thus having a dependency chain should be less critical on such a core. Still, having less dependency is often a good idea.
As for other ARM processors, recent core architecture are not as parallel as Firestorm. For example, the Cortex-A77 and Neoverse V1 have "only" 4 ALUs (which is already quite good). One need to also care about the latency of each instruction actually used in a given code. This information is available on the ARM website and AFAIK not yet published for Apple processors (one need to benchmark the instructions).
As for the pre VS post increment, I expect them to take the same time (same latency and throughput), especially on big cores like Firestorm (that try to reduce the latency of most frequent instruction at the expense of more transistors). However, the actual scheduling of the instruction for a given code can cause one to be slower than the other if the latency is not hidden by other instructions.
I received an answer to this on IRC: such usage will be fairly fast (makes sense when you consider it corresponds to typical looping patterns; good if the loop-carried dependency doesn't hurt too much), but it is still better to avoid it if possible, as it takes up rename bandwidth.
I know that the usual method when we want to make a big math computation faster is to use multiprocessing / parallel processing: we split the job in for example 4 parts, and we let 4 CPU cores run in parallel (parallelization). This is possible for example in Python with multiprocessing module: on a 4-core CPU, it would allow to use 100% of the processing power of the computer instead of only 25% for a single-process job.
But let's say we want to make faster a non-easily-splittable computation job.
Example: we are given a number generator function generate(n) that takes the previously-generated number as input, and "it is said to have 10^20 as period". We want to check this assertion with the following pseudo-code:
a = 17
for i = 1..10^20
a = generate(a)
check if a == 17
Instead of having a computer's 4 CPU cores (3.3 Ghz) running "in parallel" with a total of 4 processes, is it possible to emulate one very fast single-core CPU of 13.2 Ghz (4*3.3) running one single process with the previous code?
Is such technique available for a desktop computer? If not, is it available on cloud computing platforms (AWS EC2, etc.)?
Single-threaded performance is extremely valuable; it's much easier to write sequential code than to explicitly expose thread-level parallelism.
If there was an easy and efficient general-purpose way to do what you're asking which works when there is no parallelism in the code, it would already be in widespread use. Either internally inside multi-core CPUs, or in software if it required higher-level / larger-scale code transformations.
Out-of-order CPUs can find and exploit instruction-level parallelism within a single thread (over short distances, like a couple hundred instructions), but you need explicit thread-level parallelism to take advantage of multiple cores.
This is similar to How does a single thread run on multiple cores? over on SoftwareEnginnering.SE, except that you've already ruled out any easy-to-find parallelism including instruction-level parallelism. (And the answer is: it doesn't. It's the hardware of a single core that finds the instruction-level parallelism in a single thread; my answer there explains some of the microarchitectural details of how that works.)
The reverse process: turning one big CPU into multiple weaker CPUs does exist, and is useful for running multiple threads which don't have much instruction-level parallelism. It's called SMT (Simultaneous MultiThreading). You've probably heard of Intel's Hyperthreading, the most widely known implementation of SMT. It trades single-threaded performance for more throughput, keeping more execution units fed with useful work more of the time. The cost of building a single wide core grows at least quadratically, which is why typical desktop CPUs don't just have a single massive core with 8-way SMT. (And note that a really wide CPU still wouldn't help with a totally dependent instruction stream, unless the generate function has some internal instruction-level parallelism.)
SMT would be good if you wanted to test 8 different generate() functions at once on a quad-core CPU. Without SMT, you could alternate in software between two generate chains in one thread, so out-of-order execution could be working on instructions from both dependency chains in parallel.
Auto-parallelization by compilers at compile time is possible for source that has some visible parallelism, but if generate(a) isn't "separable" (not the correct technical term, I think) then you're out of luck.
e.g. if it's return a + hidden_array[static_counter++]; then the compiler can use math to prove that summing chunks of the array in parallel and adding the partial sums will still give the same result.
But if there's truly a serial dependency through a (like even a simple LCG PRNG), and the software doesn't know any mathematical tricks to break the dependency or reduce it to a closed form, you're out of luck. Compilers do know tricks like sum(0..n) = n*(n+1)/2 (evaluated slightly differently to avoid integer overflow in a partial result), or a+a+a+... (n times) is a * n, but that doesn't help here.
There is a scheme studied mostly in the academy called "Thread Decomposition". It aims to do more or less what you ask about - given a single-threaded code, it tries to break it down into multiple threads in order to divide the work on a multicore system. This process can be done by a compiler (although this requires figuring out all possible side effects at compile time which is very hard), by a JIT runtime, or through HW binary-translation, but each of these methods has complicated limitations and drawbacks.
Unfortunately, other than being automated, this process has very little appeal as it can hardly match true manual parallelization done by a person how understands the code. It also doesn't simply scale performance according to the number of threads, since it usually incurs a large overhead in the form of code that has to be duplicated.
Example paper by some nice folks from UPC in Barcelona: http://ieeexplore.ieee.org/abstract/document/5260571/
I am tunning my GEMM code and comparing with Eigen and MKL. I have a system with four physical cores. Until now I have used the default number of threads from OpenMP (eight on my system). I assumed this would be at least as good as four threads. However, I discovered today that if I run Eigen and my own GEMM code on a large dense matrix (1000x1000) I get better performance using four threads instead of eight. The efficiency jumped from 45% to 65%. I think this can be also seen in this plot
https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba
The difference is quite substantial. However, the performance is much less stable. The performance jumps around quit a bit each iteration both with Eigen and my own GEMM code. I'm surprised that Hyperthreading makes the performance so much worse. I guess this is not not a question. It's an unexpected observation which I'm hoping to find feedback on.
I see that not using hyper threading is also suggested here.
How to speed up Eigen library's matrix product?
I do have a question regarding measuring max performance. What I do now is run CPUz and look at the frequency as I'm running my GEMM code and then use that number in my code (4.3 GHz on one overclocked system I use). Can I trust this number for all threads? How do I know the frequency per thread to determine the maximum? How to I properly account for turbo boost?
The purpose of hyperthreading is to improve CPU usage for code exhibiting high latency. Hyperthreading masks this latency by treating two threads at once thus having more instruction level parallelism.
However, a well written matrix product kernel exhibits an excellent instruction level parallelism and thus exploits nearly 100% of the CPU ressources. Therefore there is no room for a second "hyper" thread, and the overhead of its management can only decrease the overall performance.
Unless I've missed something, always possible, your CPU has one clock shared by all its components so if you measure it's rate at 4.3GHz (or whatever) then that's the rate of all the components for which it makes sense to figure out a rate. Imagine the chaos if this were not so, some cores running at one rate, others at another rate; the shared components (eg memory access) would become unmanageable.
As to hyperthreading actually worsening the performance of your matrix multiplication, I'm not surprised. After all, hyperthreading is a poor-person's parallelisation technique, duplicating instruction pipelines but not functional units. Once you've got your code screaming along pushing your n*10^6 contiguous memory locations through the FPUs a context switch in response to a pipeline stall isn't going to help much. At best the other pipeline will scream along for a while before another context switch robs you of useful clock cycles, at worst all the careful arrangement of data in the memory hierarchy will be horribly mangled at each switch.
Hyperthreading is designed not for parallel numeric computational speed but for improving the performance of a much more general workload; we use general-purpose CPUs in high-performance computing not because we want hyperthreading but because all the specialist parallel numeric CPUs have gone the way of all flesh.
As a provider of multithreaded concurrency services, I have explored how hyperthreading affects performance under a variety of conditions. I have found that with software that limits its own high-utilization threads to no more that the actual physical processors available, the presence or absence of HT makes very little difference. Software that attempts to use more threads than that for heavy computational work, is likely unaware that it is doing so, relying on merely the total processor count (which doubles under HT), and predictably runs more slowly. Perhaps the largest benefit that enabling HT may provide, is that you can max out all physical processors, without bringing the rest of the system to a crawl. Without HT, software often has to leave one CPU free to keep the host system running normally. Hyperthreads are just more switchable threads, they are not additional processors.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I know many examples when GPU is much faster than CPU. But exists algorithms (problems) which are very hard to parallelise. Could you give me some examples or tests when CPU can overcome GPU ?
Edit:
Thanks for suggestions! We can make a comparison between the most popular and the newest cpu's and gpu's, for example Core i5 2500k vs GeForce GTX 560 Ti.
I wonder how to compare SIMD model between them. For example: Cuda calls a SIMD model more precisely SIMT. But SIMT should be compared to the multhitreading on CPU's which is distributing threads (tasks) between MIMD cores (Core i5 2500k give as 4 MIMD cores). On the other hand each of these MIMD cores can implement SIMD model, but this is something else than SIMT and I don't know how to compare them. Finally a fermi architecture with concurrent kernel execution might be consider as MIMD cores with SIMT.
Based on my experience, I will summarize the key differences in terms of performance between parallel programs in CPUs and GPUs. Trust me, a comparison can be changed from generation to generation. So I will just point out what is good and is bad for CPUs and GPUs. Of course, if you make a program at the extreme, i.e., having only bad or good sides, it will run definitely faster on one platform. But a mixture of those requires very complicated reasoning.
Host program level
One key difference is memory transfer cost. GPU devices requires some memory transfers. This cost is non-trivial in some cases, for example when you have to frequently transfer some big arrays. In my experience, this cost can be minimized but pushing most of host code to device code. The only cases you can do so are when you have to interact with the host operating system in program, such as outputting to monitor.
Device program level
Now we come to see a complex picture that hasn't been fully revealed yet. What I mean is there are many mysterious scenes in GPUs that haven't been disclosed. But still, we have a lot of distinguish CPU and GPU (kernel code) in terms of performance.
There are few factors that I noticed those dramatically contribute to the difference.
Workload distribution
GPUs, which consist of many execution units, are designed to handle massively parallel programs. If you have little of work, say a few sequential tasks, and put these tasks on a GPU, only a few of those many execution units are busy, thus will be slower than CPU. Because CPUs are, in other hand, better to handle short and sequential tasks. The reason is simple, CPUs are much more complicated and able to exploit instruction level parallelism, whereas GPUs exploit thread level parallelism. Well, I heard NVIDIA GF104 can do Superscalar, but I had no chance to experience with it though.
It is worth noting that, in GPUs, workload are divided into small blocks (or workgroups in OpenCL), and blocks are arranged in chunks, each of which is executed in one Streaming processor (I am using terminologies from NVIDIA). But in CPUs, those blocks are executed sequentially - I can't think of anything else than a single loop.
Thus, for programs that have small number of blocks, it will be likely to run faster on CPUs.
Control flow instructions
Branches are bad things to GPUs, always. Please bear in mind that GPUs prefer equal things. Equal blocks, equal threads within a blocks, and equal threads within a warp. But what matters the most?
***Branch divergences.***
Cuda/OpenCL programmers hate branch divergences. Since all the threads somehow are divided into sets of 32 threads, called a warp, and all threads within a warp execute in lockstep, a branch divergence will cause some threads in the warp to be serialized. Thus, the execution time of the warp will be accordingly multiplied.
Unlike GPUs, each cores in CPUs can follow their own path. Furthermore, branches can be efficiently executed because CPUs have branch prediction.
Thus, programs that have more warp divergences are likely to run faster on CPUs.
Memory access instructions
This REALLY is complicated enough so let's make it brief.
Remember that global memory accesses have very high latency (400-800 cycles). So in old generations of GPUs, whether memory accesses are coalesced was a critical matter. Now your GTX560 (Fermi) has more 2 level of caches. So global memory access cost can be reduced in many cases. However, caches in CPUs and GPUs are different, so their effects are also different.
What I can say is that it really really depends on your memory access pattern, your kernel code pattern (how memory accesses are interleaved with computation, the types of operations, etc., ) to tell if one runs faster on GPUs or CPUs.
But somehow you can expect a huge number of cache misses (in GPUs) has a very bad effect on GPUs (how bad? - it depends on your code).
Additionally, shared memory is an important feature of GPUs. Accessing to shared memory is as fast as accessing to GPU L1 cache. So kernels that make use of shared memory will have pretty much benefit.
Some other factors I haven't really mentioned but those can have big impact on the performance in many cases such as bank conflicts, size of memory transaction, GPU occupancy...