Is a multi-core CPU required to implement SIMD?
I found the following phrase "multiple processing elements" when reading Wikipedia about SIMD. So what's the difference between this phrase and "multi-core CPU"?
Every core has its own independent SIMD execution units. Using SIMD instructions in one core doesn't cost execution resources in other cores. Separate cores even on the same physical chip are independent so they can go to sleep separately to save power, and various other design reasons for keeping them isolated.
One exception that I'm aware of: AMD Bulldozer has two weak integer cores sharing a SIMD / FPU and sharing some cache. They call this a "cluster", and it's basically an alternative to Hyperthreading (SMT). See David Kanter's Bulldozer write-up on RealworldTech.
SIMD and multi-core are orthogonal: you can have multi-core without SIMD (maybe some ARM chips without an FPU / NEON), and you can have SIMD without multi-core.
Many examples of the latter, including most prominently early x86 chips like Pentium-MMX through Pentium III / Pentium 4 that has MMX / SSE1 / SSE2 but were single-core CPUs.
There are at least three different kinds of parallelism in programs:
Instruction-level parallelism: it's possible to overlap some of the work done by different instructions within the same single thread of execution, preserving the illusion of running every instruction one after another. Exploit it by building a pipelined CPU core, or superscalar (multiple instructions per clock), or even out-of-order execution. (See my answer on a question about that for details.)
When creating software: Expose this parallelism to the hardware by avoiding long dependency chains whenever possible. (e.g. replace sum += a[i++] with sum1+=a[i]; sum2+=a[i+1]; i+=2;: unroll with multiple accumulators). Or use arrays instead of linked lists, because the next address to load is computed cheaply, instead of being part of the data from memory you have to wait for on a cache miss. But mostly ILP is already there in "normal" code without doing anything special, and you build bigger / fancier hardware to find more of it, and increase the average instructions-per-clock.
Data parallelism: you need to do the same thing to every pixel of an image, or every sample in an audio file. (e.g. blend 2 images, or mix two audio streams). Exploit this by building parallel execution units into each CPU core so a single instruction can do 16 single-byte additions in parallel, giving you increased throughput with no increase in the amount of instructions you need to get through the CPU core per clock. This is SIMD: Single Instruction, Multiple Data.
Audio / video are the most well-known applications of this, where the speedups are massive because you can fit a lot of byte or 16-bit elements into a single fixed-width vector register.
Exploit SIMD by auto-vectorizing loops with smart compilers, or manually. SIMD turns sum += a[i]; into sum[0..3] += a[i+0..3] (for 4 elements per vector, like with int or float with 32-bit vectors).
Thread/task-level parallelism: exploit with multi-core CPUs, expose to the hardware by manually writing multi-threaded code, or using OpenMP or other auto-parallelization tools to multi-thread a loop, or use a library function that starts multiple threads for a big matrix multiply or something.
Or more simply by running multiple separate programs at once. e.g. compile with make -j8 to keep 8 compile processes in flight at once. Coarse-grained task-level parallelism can also be exploited by running your workload on a cluster of multiple computers, or even distributed computing.
But multi-core CPUs make it possible / efficient to exploit fine-grained thread-level parallelism where tasks need to share lots of data (like a large array), or have low latency communication through shared memory. (e.g. with locks to protect different parts of shared data, or lockless programming.)
These three kinds of parallelism are orthogonal.
To sum a very large array of float on a modern CPU:
You'd start one thread per CPU core, and have each core loop over a chunk of the array in shared memory. (Thread-level parallelism). This gives you a factor of 4 speedup, let's say. (Even that's maybe unrealistic because of memory bottlenecks, but you can imagine some other computationally intensive task that didn't require reading so much memory, running on a 28-core Xeon, or a dual-socket server with two of those chips...)
The code for each thread would use SIMD to do 4 or 8 adds per instruction, on each core separately. (SIMD). This gives you a factor of 4 or 8 speedup. (Or 16 with AVX512)
You'd unroll with let's say 8 vector accumulators to hide the latency of floating-point add. (ILP). Skylake's vaddps instruction has a latency of 4 cycles and a throughput of 0.5 cycles (i.e. 2 per clock). So 8 accumulators is just barely enough to hide that latency and keep 8 FP add instructions in flight at once.
The total throughput gain over single-threaded scalar sum += a[i++] is the product of all those speedup factors: 4 * 8 * 8 = 256x the throughput of a non-parallelized, non-vectorized, single-accumulator ILP-bottlenecked naive implementation like you'd get from gcc -O2 for a simple loop. clang -O3 -march=native -ffast-math would give SIMD, and some ILP (because clang knows how to use multiple accumulators when unrolling, often using 4, unlike gcc.)
You'd need OpenMP or other auto-parallelization to exploit multiple cores.
Related: Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for a more in-depth look at multiple accumulators for ILP, and SIMD, for an FMA loop.
No, each core normally can perform most general operations from the instruction set. But the "multiple processing elements" for SIMD operations just perform a single operation on different data (different bytes or words).
For example, each core of ARM Cortex-A53 microarchitecture has capability to run SIMD instructions independently of other cores, while such SIMD instruction sets as MMX, SSE and SSE2 were first introduced on single-core CPUs.
Yes. It does. But only from the marketing point of view. It would be difficult to sell uP or uC with no SIMD instructions.
Related
Can a CUDA INT32 Core process two different integer instructions completelly parallel, without context switching? I know that it is not possible on a CPU, but on a NVIDIA GPU? I know that a SM can run warps, and if core has to wait for some information, then a it gets another thread from the dispatch unit.
I know that it is not possible on a CPU, but on a NVIDIA GPU?
This assertion is wrong on modern mainstream CPUs (eg. since at least a decade for nearly all x86-64 processors, starting from Intel Skylake or AMD Zen 2). Indeed, modern x86-64 Intel/AMD processor can generally compute 2 (256 AVX) SIMD vectors in parallel since there is generally 2 SIMD units. Processors like Intel Skylake also have 4 ALU units capable of computing 4 basic arithmetic operations (eg. add, sub, and, xor) in parallel per cycle. Some instruction like division are far more expensive and do not run in parallel on such architecture though it is well pipelined. The instructions can come from the same thread on the same logical cores or possibly 2 threads (of possibly 2 different processes) scheduled on 2 logical cores without any context switches. Note that recent high-end ARM processors can also do this (even some mobile processors).
Can a CUDA INT32 Core process two different integer instructions completelly parallel, without context switching?
NVIDIA GPUs execute groups of threads known as warps in SIMT (Single Instruction, Multiple Thread) fashion. Thus, 1 instruction operate on 32 items in parallel (though, theoretically, an hardware can be free not to do that completely in parallel). A kernel execution basically contains many block and blocks are scheduled to SM. An SM can operate on many blocks concurrently so there is a massive amount of parallelism available.
Whether a specific GPU can execute two INT32 warp in parallel it is dependent of the target architecture, not CUDA itself. On modern Nvidia GPUs, each SM can be split in multiple partitions that can each execute instructions on blocks independently of the other partitions. For example, AFAIK, on a Pascal GP104, there is 20 SM and each SM has 4 partition capable of running SIMD instructions operating on 1 warp (32 items) at time. In practice, things can be a bit more complex on newer architectures. You can get more information here.
On a computer with an Intel CPU marketed "6 cores / 12 threads", I want to run as many processes as possible, each of them doing math similar computations (each process has a single thread) with different input data. There is no GPU involved, and no inter-process communication is needed.
What is the optimal number of parallel processes of the same executable doing math computations?
Should I run 6 processes (one per physical core)? Or 12 processes (one per thread / virtual core)?
If one process does, say, 1000 computations per second, I'm pretty sure that running 6 of them will run at ~1000/sec each (so a total of ~6000/sec).
But won't running 12 processes make them only 500 computations per second each?
TL;DR: should I run one process per "core" or one process per "thread" on a "6 cores/12 threads Intel CPU"?
It is very dependent of the actual computing code. Some application can benefit from hyper-threading while some do not. High-performance application rarely benefit from hyper-threading so using 1 process per core is certainly the best configuration assuming the code is compute bound and scale well.
Multiple hyper-threads of recent Intel processors (eg. Skylake/Icelake) can share some execution ports. As a result, the overall execution can be faster if one process is not able to saturate the ports. In practice, this is a bit more complex (modern processor are very complex) since compute-bound processes can be bound by other part of the processor like instruction decoding or more tricky low-level units.
For example, the following C code should benefit from hyper-threading (assuming no fast-math optimizations are applied and the code is compiler with optimizations):
float sum = 0.f;
for(int i=0 ; i<maxi ; ++i)
sum += array[i];
Indeed, the latency of a floating-point addition instruction is 3 to 4 cycles while generally 2 of them can be executed per cycle (only 1 before Skylake). This means the code is bound by the latency of the addition instruction chain. Hyper-threads can use the waiting execution port during this time resulting in a up to twice faster execution (other bottleneck cause the execution not to be so fast in practice). If the code is optimized with fast-math optimization, then compilers can unroll the loop and make use of instruction-level parallelism (IPC). A low IPC often means that using hyper-thread may be beneficial, especially if the cause of this low IPC is due to latency issues (eg. instruction latency and cache misses). Unfortunately, this is not always true. For example, the following code should not be faster with hyper-threading:
for(int i=0 ; i<maxi ; ++i)
out_array[i] += in_array[i];
This is because there is generally 1 execution store port on Intel processor and it should already be saturated with 1 hyper-thread (otherwise it should be memory throughput bound which is not better for hyper-threading). Thus using more hyper-thread should not improve the execution time. In fact, hyper-threading introduces a slight overhead that should cause a slightly slower execution.
The thing is applications are generally much more complex than that and one does not know how math functions are implemented. A a result, this is nearly impossible for a developer to know what is the best configuration without a basic benchmark unless the computing kernel is simple.
I think I have a decent understanding of the difference between latency and throughput, in general. However, the implications of latency on instruction throughput are unclear to me for Intel Intrinsics, particularly when using multiple intrinsic calls sequentially (or nearly sequentially).
For example, let's consider:
_mm_cmpestrc
This has a latency of 11, and a throughput of 7 on a Haswell processor. If I ran this instruction in a loop, would I get a continuous per cycle-output after 11 cycles? Since this would require 11 instructions to be running at a time, and since I have a throughput of 7, do I run out of "execution units"?
I am not sure how to use latency and throughput other than to get an impression of how long a single instruction will take relative to a different version of the code.
For a much more complete picture of CPU performance, see Agner Fog's microarchitecture guide and instruction tables. (Also his Optimizing C++ and Optimizing Assembly guides are excellent). See also other links in the x86 tag wiki, especially Intel's optimization manual.
See also
https://uops.info/ for accurate tables collected programmatically from microbenchmarks, so they're free from editing errors like Agner's tables sometimes have.
How many CPU cycles are needed for each assembly instruction?
and What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? for more details about using instruction-cost numbers.
What is the efficient way to count set bits at a position or lower? For an example of analyzing short sequences of asm in terms of front-end uops, back-end ports, and latency.
Modern Microprocessors: A 90-Minute Guide! very good intro to the basics of CPU pipelines and HW design constraints like power.
Latency and throughput for a single instruction are not actually enough to get a useful picture for a loop that uses a mix of vector instructions. Those numbers don't tell you which intrinsics (asm instructions) compete with each other for throughput resources (i.e. whether they need the same execution port or not). They're only sufficient for super-simple loops that e.g. load / do one thing / store, or e.g. sum an array with _mm_add_ps or _mm_add_epi32.
You can use multiple accumulators to get more instruction-level parallelism, but you're still only using one intrinsic so you do have enough information to see that e.g. CPUs before Skylake can only sustain a throughput of one _mm_add_ps per clock, while SKL can start two per clock cycle (reciprocal throughput of one per 0.5c). It can run ADDPS on both its fully-pipelined FMA execution units, instead of having a single dedicated FP-add unit, hence the better throughput but worse latency than Haswell (3c lat, one per 1c tput).
Since _mm_add_ps has a latency of 4 cycles on Skylake, that means 8 vector-FP add operations can be in flight at once. So you need 8 independent vector accumulators (which you add to each other at the end) to expose that much parallelism. (e.g. manually unroll your loop with 8 separate __m256 sum0, sum1, ... variables. Compiler-driven unrolling (compile with -funroll-loops -ffast-math) will often use the same register, but loop overhead wasn't the problem).
Those numbers also leave out the third major dimension of Intel CPU performance: fused-domain uop throughput. Most instructions decode to a single uop, but some decode to multiple uops. (Especially the SSE4.2 string instructions like the _mm_cmpestrc you mentioned: PCMPESTRI is 8 uops on Skylake). Even if there's no bottleneck on any specific execution port, you can still bottleneck on the frontend's ability to keep the out-of-order core fed with work to do. Intel Sandybridge-family CPUs can issue up to 4 fused-domain uops per clock, and in practice can often come close to that when other bottlenecks don't occur. (See Is performance reduced when executing loops whose uop count is not a multiple of processor width? for some interesting best-case frontend throughput tests for different loop sizes.) Since load/store instructions use different execution ports than ALU instructions, this can be the bottleneck when data is hot in L1 cache.
And unless you look at the compiler-generated asm, you won't know how many extra MOVDQA instructions the compiler had to use to copy data between registers, to work around the fact that without AVX, most instructions replace their first source register with the result. (i.e. destructive destination). You also won't know about loop overhead from any scalar operations in the loop.
I think I have a decent understanding of the difference between latency and throughput
Your guesses don't seem to make sense, so you're definitely missing something.
CPUs are pipelined, and so are the execution units inside them. A "fully pipelined" execution unit can start a new operation every cycle (throughput = one per clock)
(reciprocal) Throughput is how often an operation can start when no data dependencies force it to wait, e.g. one per 7 cycles for this instruction.
Latency is how long it takes for the results of one operation to be ready, and usually matters only when it's part of a loop-carried dependency chain.
If the next iteration of a loop operates independently from the previous, then out-of-order execution can "see" far enough ahead to find the instruction-level parallelism between two iterations and keep itself busy, bottlenecking only on throughput.
See also Latency bounds and throughput bounds for processors for operations that must occur in sequence for an example of a practice problem from CS:APP with a diagram of two dep chains, one also depending on results from the other.
Is it possible to query the number of execution unit/port per core and similar information on Intel CPU?
I have an assembly program, and noticed that the performance is quite different on different CPU's. For example, on an Core i5 4570, some functions takes consistently 25% cycles to complete than on an Core i7 4970HQ. They are both Haswell based, from the same generation. No memory movement is involved in the part of program benchmarked. So I am thinking maybe the difference comes from the details such as number of execution unit, number of ports etc. The benchmark measures single core CPU cycles, so frequencies/HT etc does not come into play.
Am I right to assume such an explanation of performance difference? If yes, where can I find such informations for specific CPUs. And is it possible to query it dynamically? If possible, then I can dispatch dynamically based on such informations and distribution uops more evenly and similar techniques to optimize the program for multiple CPUs.
Did you time reference cycles (RDTSC) instead of core clock cycles (with perf counters)? That would explain your observations.
Turbo makes a big difference, and the ratio between max turbo and max sustained / rated clock speed (i.e. reference cycle tick rate) is different on different CPUs. e.g. see my answer on this related question
The lower the CPU's TDP, the bigger the ratio between sustained and peak. The Haswell wikipedia article has tables:
84W desktop i5 4570: sustained 3.2GHz = RDTSC frequency, max turbo 3.6GHz (the speed the core was probably actually running for most of your benchmark, if it had time to go up from low-power idle speed).
47W laptop i7-4960HQ: 2.6GHz sustained = RDTSC frequency vs. 3.8GHz max turbo.
Time your code with performance counters, and look at the "core clock cycles" count. (And lots of other neat stuff).
Every Haswell core is identical from Core-M 5Watt CPUs to high-power quad core to 18-core Xeon (which actually has a per-core power-budget more like a laptop CPU); it's only the L3 caches, number of cores (and interconnect), and support or not for HT and/or Turbo that differ. Basically everything outside the cores themselves can be different, including the GPU. They don't disable execution ports, and even the L1/L2 caches are identical. I think disabling execution ports would require significant redesigns in the out-of-order scheduler and stuff like that.
More importantly, every port has at least one execution unit that isn't found on any other port: p0 has the divider, p1 has the integer multiply unit, p5 has the shuffle unit, and p6 is the only port that can execute predicted-taken branches. Actually, p2 and p3 are identical load ports (and can handle store-address uops)...
See Agner Fog's microarch pdf for more about Haswell internals, and also David Kanter's writeup with diagrams of the different blocks.
(However, it's not strictly true that the entire core is identical: Haswell Pentium/Celeron CPUs don't support AVX/AVX2, or BMI/BMI2. I think they do that by disabling decode of VEX prefixes in the decoders. This is still the case for Skylake Pentiums/Celerons, so thanks Intel for delaying the time when we can assume support for new instruction sets. Presumably they do this so CPUs with defects in one only the upper or lower half of their vector execution units can still be sold as Celeron or Pentium, just like CPUs with a defect in some of their L3 can be sold as i5 instead of i7)
I am writing an OpenCL kernel which involves a few barriers in a loop. I have tested the kernel on CPU (8-core FX8150) and the result shows these barriers reduced running speed by a factor of 50~100 times (I further verified this by re-implementing the kernel on Java using multi-threading + CyclicBarrier). I suspect the reason was barrier essentially stops the CPU taking advantage of out-of-order execution, so I am a little worried if I would observe the same magnitude of speed decrease on GPU. I checked a few official documents and googled around a bit but there is little information available on this topic.
Current state-of-the art GPUs are in-order pipelined processor. GPUs fill the pipeline effectively by interleaving instructions from different warps (wavefronts). In comparisons, CPUs use out-of-order speculative execution to fill the pipeline. There are different functional units like ALUs and SFUs which have separated pipelines. But notice that instruction dependency stalls the warp. For more information on instruction dependency resolving on GPUs refer to this NVIDIA patent.
NVIDIA’s Next Generation
CUDA Compute and Graphics Architecture, Code-Named “Fermi”:
Nvidia GigaThread Engine has capabilities of(at page 5)
10x faster application context switching
Concurrent kernel execution
Out of Order thread block execution :)
Dual overlapped memory transfer engines
Evergreen has SIMD capabilities and has a chance outperform some fermi but i dont know about oooe of it. There is also "local atomic add" upper hand of HD 7000 series compared to GTX 600 series (nearly 10x faster)