How many ALUs are in a CPU? - cpu

I believe that there is "1" ALU per core in a CPU correct? I seem to be having a little bit of difficulty looking this up. Someone asked me in a discussion for school so I am quite curious as well.

Modern superscalar (http://en.wikipedia.org/wiki/Superscalar) CPU has many execution pipelines, and there can be ALU in several pipelines. For example, Intel *Bridge microarchitecture has 6 execution ports; and some ports has 2 or 3 execution pipelines behind them; Haswell has 8 ports. Check http://www.anandtech.com/show/6355/intels-haswell-architecture/8 - it has picture of pipelines for Nehalem, Sandy Bridge and Haswell with some ALU marked (I count 6 ALU for Haswell; and there are many smaller ALUs in this billion sea of transistors)
There are also SIMD ALU common in current days (SSE2, AVX, ...); SIMD has several ALU to work on short vector.

No, the number is very different across CPUs. Look up Superscalar on wikipedia for getting a feel and then some reference text to get started. Also, read up on Hyperthreading if you get around to that.

Related

Clock Cycles for the invlpg instruction

I was reading some documentation about the invlpg instruction for Intel Pentium processors and it says that it takes 25 clock cycles. I thought that this depended on the implementation (the particular CPU) and not the actual instruction set architecture? Or is the fact that this instruction must take 25 clock cycles to run also part of the instruction set specification?
The documentation is saying that it took 25 clock cycles on the Pentium. The number of clock cycles the instruction takes on other CPUs may be more or fewer. The performance of instructions is not part of the instruction set specification.
That number is not part of any official ISA documentation, it's just performance data that someone annotated into an old (then-current) copy of Intel's ISA docs.
It's from some random microarchitecture, presumably P5 Pentium that was relevant back when Tripod was a widely used web host, and which that guide labels itself as documenting. (These days there are Pentium/Celeron CPUs that are just cut-down versions of i3/i5/i7 of the same generation, with stuff like AVX and BMI1/2 disabled. But Pentium used to refer to the P5 microarchitecture.)
It's not from Intel's documentation; it was added by whoever compiled that HTML. The formatting is similar to modern versions of Intel's vol.2 x86 SDM instruction-set reference manual. You can find HTML extracts of that at https://github.com/HJLebbink/asm-dude/wiki/INVLPG and https://www.felixcloutier.com/x86/invlpg for example. The encoding / mnemonic / description table at the top has identical formatting in your Tripod link, but the actual text is somewhat different. Also, the text for inc (current Intel vs. tripod) is word for word identical.
So yes, this is based on an old PDF->HTML of Intel's vol.2 manual, with P5 cycles and instruction-pairing info added (inc pairs in the U or V pipe on that dual-issue in-order pipeline that doesn't break instructions down into uops). Also with FLAGS updating section turned into tables.
That instruction-pairing and cycle-count info is totally irrelevant when tuning for modern microarchitectures like Skylake and Zen, but you can find it in Agner Fog's instruction tables: his spreadsheet has a sheet for P5, as well as for later Intel, AMD, and Via microarchitectures. (Also see his optimization guide and microarch pdf for background info to help you make sense of uops / ports / latency / throughput info.) Agner doesn't test most kernel instructions so invlpg isn't in his list.
http://faydoc.tripod.com/cpu/index.htm is obviously not an official Intel source. IDK where the author of this got their info from. Maybe they tested themselves. Or Intel has sometimes published some timing numbers for some microarchitectures, e.g. as part of their optimization manual. This is totally separate from the x86 ISA manuals, and is not something you can rely on for correctness. And other people have published their test results.
Another good source for experimental test results of instruction performance (uops for which ports, latency, and throughput) is https://uops.info/. Their testing for invlpg m8 shows it has a back-to-back throughput of ~194 cycles in practice on Skylake-client, ~157 on Nehalem, and ~126.25 on Zen+ and Zen2, to pick some random examples. But it may interleave better with other instructions, taking "only" 47 front-end uops on recent Intel CPUs and thus can issue in under 12 cycles if the back-end has room in the ROB / RS, maybe letting later instructions execute while the invlpg operation is in progress. (Although if it takes over 100 cycles for its uops to retire, that will often stall OoO exec at some point for a fraction of the total time.)
Remember that instruction performance can't be characterized by a single number on out-of-order CPUs; it's not one dimensional. Perf analysis is not as simple as adding up a cycle costs for all instructions in a loop, you have to analyze how the can overlap with each other. Or for complex cases like invlpg, measure.

Does SIMD require a multi-core CPU?

Is a multi-core CPU required to implement SIMD?
I found the following phrase "multiple processing elements" when reading Wikipedia about SIMD. So what's the difference between this phrase and "multi-core CPU"?
Every core has its own independent SIMD execution units. Using SIMD instructions in one core doesn't cost execution resources in other cores. Separate cores even on the same physical chip are independent so they can go to sleep separately to save power, and various other design reasons for keeping them isolated.
One exception that I'm aware of: AMD Bulldozer has two weak integer cores sharing a SIMD / FPU and sharing some cache. They call this a "cluster", and it's basically an alternative to Hyperthreading (SMT). See David Kanter's Bulldozer write-up on RealworldTech.
SIMD and multi-core are orthogonal: you can have multi-core without SIMD (maybe some ARM chips without an FPU / NEON), and you can have SIMD without multi-core.
Many examples of the latter, including most prominently early x86 chips like Pentium-MMX through Pentium III / Pentium 4 that has MMX / SSE1 / SSE2 but were single-core CPUs.
There are at least three different kinds of parallelism in programs:
Instruction-level parallelism: it's possible to overlap some of the work done by different instructions within the same single thread of execution, preserving the illusion of running every instruction one after another. Exploit it by building a pipelined CPU core, or superscalar (multiple instructions per clock), or even out-of-order execution. (See my answer on a question about that for details.)
When creating software: Expose this parallelism to the hardware by avoiding long dependency chains whenever possible. (e.g. replace sum += a[i++] with sum1+=a[i]; sum2+=a[i+1]; i+=2;: unroll with multiple accumulators). Or use arrays instead of linked lists, because the next address to load is computed cheaply, instead of being part of the data from memory you have to wait for on a cache miss. But mostly ILP is already there in "normal" code without doing anything special, and you build bigger / fancier hardware to find more of it, and increase the average instructions-per-clock.
Data parallelism: you need to do the same thing to every pixel of an image, or every sample in an audio file. (e.g. blend 2 images, or mix two audio streams). Exploit this by building parallel execution units into each CPU core so a single instruction can do 16 single-byte additions in parallel, giving you increased throughput with no increase in the amount of instructions you need to get through the CPU core per clock. This is SIMD: Single Instruction, Multiple Data.
Audio / video are the most well-known applications of this, where the speedups are massive because you can fit a lot of byte or 16-bit elements into a single fixed-width vector register.
Exploit SIMD by auto-vectorizing loops with smart compilers, or manually. SIMD turns sum += a[i]; into sum[0..3] += a[i+0..3] (for 4 elements per vector, like with int or float with 32-bit vectors).
Thread/task-level parallelism: exploit with multi-core CPUs, expose to the hardware by manually writing multi-threaded code, or using OpenMP or other auto-parallelization tools to multi-thread a loop, or use a library function that starts multiple threads for a big matrix multiply or something.
Or more simply by running multiple separate programs at once. e.g. compile with make -j8 to keep 8 compile processes in flight at once. Coarse-grained task-level parallelism can also be exploited by running your workload on a cluster of multiple computers, or even distributed computing.
But multi-core CPUs make it possible / efficient to exploit fine-grained thread-level parallelism where tasks need to share lots of data (like a large array), or have low latency communication through shared memory. (e.g. with locks to protect different parts of shared data, or lockless programming.)
These three kinds of parallelism are orthogonal.
To sum a very large array of float on a modern CPU:
You'd start one thread per CPU core, and have each core loop over a chunk of the array in shared memory. (Thread-level parallelism). This gives you a factor of 4 speedup, let's say. (Even that's maybe unrealistic because of memory bottlenecks, but you can imagine some other computationally intensive task that didn't require reading so much memory, running on a 28-core Xeon, or a dual-socket server with two of those chips...)
The code for each thread would use SIMD to do 4 or 8 adds per instruction, on each core separately. (SIMD). This gives you a factor of 4 or 8 speedup. (Or 16 with AVX512)
You'd unroll with let's say 8 vector accumulators to hide the latency of floating-point add. (ILP). Skylake's vaddps instruction has a latency of 4 cycles and a throughput of 0.5 cycles (i.e. 2 per clock). So 8 accumulators is just barely enough to hide that latency and keep 8 FP add instructions in flight at once.
The total throughput gain over single-threaded scalar sum += a[i++] is the product of all those speedup factors: 4 * 8 * 8 = 256x the throughput of a non-parallelized, non-vectorized, single-accumulator ILP-bottlenecked naive implementation like you'd get from gcc -O2 for a simple loop. clang -O3 -march=native -ffast-math would give SIMD, and some ILP (because clang knows how to use multiple accumulators when unrolling, often using 4, unlike gcc.)
You'd need OpenMP or other auto-parallelization to exploit multiple cores.
Related: Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for a more in-depth look at multiple accumulators for ILP, and SIMD, for an FMA loop.
No, each core normally can perform most general operations from the instruction set. But the "multiple processing elements" for SIMD operations just perform a single operation on different data (different bytes or words).
For example, each core of ARM Cortex-A53 microarchitecture has capability to run SIMD instructions independently of other cores, while such SIMD instruction sets as MMX, SSE and SSE2 were first introduced on single-core CPUs.
Yes. It does. But only from the marketing point of view. It would be difficult to sell uP or uC with no SIMD instructions.

latency vs throughput in intel intrinsics

I think I have a decent understanding of the difference between latency and throughput, in general. However, the implications of latency on instruction throughput are unclear to me for Intel Intrinsics, particularly when using multiple intrinsic calls sequentially (or nearly sequentially).
For example, let's consider:
_mm_cmpestrc
This has a latency of 11, and a throughput of 7 on a Haswell processor. If I ran this instruction in a loop, would I get a continuous per cycle-output after 11 cycles? Since this would require 11 instructions to be running at a time, and since I have a throughput of 7, do I run out of "execution units"?
I am not sure how to use latency and throughput other than to get an impression of how long a single instruction will take relative to a different version of the code.
For a much more complete picture of CPU performance, see Agner Fog's microarchitecture guide and instruction tables. (Also his Optimizing C++ and Optimizing Assembly guides are excellent). See also other links in the x86 tag wiki, especially Intel's optimization manual.
See also
https://uops.info/ for accurate tables collected programmatically from microbenchmarks, so they're free from editing errors like Agner's tables sometimes have.
How many CPU cycles are needed for each assembly instruction?
and What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? for more details about using instruction-cost numbers.
What is the efficient way to count set bits at a position or lower? For an example of analyzing short sequences of asm in terms of front-end uops, back-end ports, and latency.
Modern Microprocessors: A 90-Minute Guide! very good intro to the basics of CPU pipelines and HW design constraints like power.
Latency and throughput for a single instruction are not actually enough to get a useful picture for a loop that uses a mix of vector instructions. Those numbers don't tell you which intrinsics (asm instructions) compete with each other for throughput resources (i.e. whether they need the same execution port or not). They're only sufficient for super-simple loops that e.g. load / do one thing / store, or e.g. sum an array with _mm_add_ps or _mm_add_epi32.
You can use multiple accumulators to get more instruction-level parallelism, but you're still only using one intrinsic so you do have enough information to see that e.g. CPUs before Skylake can only sustain a throughput of one _mm_add_ps per clock, while SKL can start two per clock cycle (reciprocal throughput of one per 0.5c). It can run ADDPS on both its fully-pipelined FMA execution units, instead of having a single dedicated FP-add unit, hence the better throughput but worse latency than Haswell (3c lat, one per 1c tput).
Since _mm_add_ps has a latency of 4 cycles on Skylake, that means 8 vector-FP add operations can be in flight at once. So you need 8 independent vector accumulators (which you add to each other at the end) to expose that much parallelism. (e.g. manually unroll your loop with 8 separate __m256 sum0, sum1, ... variables. Compiler-driven unrolling (compile with -funroll-loops -ffast-math) will often use the same register, but loop overhead wasn't the problem).
Those numbers also leave out the third major dimension of Intel CPU performance: fused-domain uop throughput. Most instructions decode to a single uop, but some decode to multiple uops. (Especially the SSE4.2 string instructions like the _mm_cmpestrc you mentioned: PCMPESTRI is 8 uops on Skylake). Even if there's no bottleneck on any specific execution port, you can still bottleneck on the frontend's ability to keep the out-of-order core fed with work to do. Intel Sandybridge-family CPUs can issue up to 4 fused-domain uops per clock, and in practice can often come close to that when other bottlenecks don't occur. (See Is performance reduced when executing loops whose uop count is not a multiple of processor width? for some interesting best-case frontend throughput tests for different loop sizes.) Since load/store instructions use different execution ports than ALU instructions, this can be the bottleneck when data is hot in L1 cache.
And unless you look at the compiler-generated asm, you won't know how many extra MOVDQA instructions the compiler had to use to copy data between registers, to work around the fact that without AVX, most instructions replace their first source register with the result. (i.e. destructive destination). You also won't know about loop overhead from any scalar operations in the loop.
I think I have a decent understanding of the difference between latency and throughput
Your guesses don't seem to make sense, so you're definitely missing something.
CPUs are pipelined, and so are the execution units inside them. A "fully pipelined" execution unit can start a new operation every cycle (throughput = one per clock)
(reciprocal) Throughput is how often an operation can start when no data dependencies force it to wait, e.g. one per 7 cycles for this instruction.
Latency is how long it takes for the results of one operation to be ready, and usually matters only when it's part of a loop-carried dependency chain.
If the next iteration of a loop operates independently from the previous, then out-of-order execution can "see" far enough ahead to find the instruction-level parallelism between two iterations and keep itself busy, bottlenecking only on throughput.
See also Latency bounds and throughput bounds for processors for operations that must occur in sequence for an example of a practice problem from CS:APP with a diagram of two dep chains, one also depending on results from the other.

Query Intel CPU details of execution unit, port, etc

Is it possible to query the number of execution unit/port per core and similar information on Intel CPU?
I have an assembly program, and noticed that the performance is quite different on different CPU's. For example, on an Core i5 4570, some functions takes consistently 25% cycles to complete than on an Core i7 4970HQ. They are both Haswell based, from the same generation. No memory movement is involved in the part of program benchmarked. So I am thinking maybe the difference comes from the details such as number of execution unit, number of ports etc. The benchmark measures single core CPU cycles, so frequencies/HT etc does not come into play.
Am I right to assume such an explanation of performance difference? If yes, where can I find such informations for specific CPUs. And is it possible to query it dynamically? If possible, then I can dispatch dynamically based on such informations and distribution uops more evenly and similar techniques to optimize the program for multiple CPUs.
Did you time reference cycles (RDTSC) instead of core clock cycles (with perf counters)? That would explain your observations.
Turbo makes a big difference, and the ratio between max turbo and max sustained / rated clock speed (i.e. reference cycle tick rate) is different on different CPUs. e.g. see my answer on this related question
The lower the CPU's TDP, the bigger the ratio between sustained and peak. The Haswell wikipedia article has tables:
84W desktop i5 4570: sustained 3.2GHz = RDTSC frequency, max turbo 3.6GHz (the speed the core was probably actually running for most of your benchmark, if it had time to go up from low-power idle speed).
47W laptop i7-4960HQ: 2.6GHz sustained = RDTSC frequency vs. 3.8GHz max turbo.
Time your code with performance counters, and look at the "core clock cycles" count. (And lots of other neat stuff).
Every Haswell core is identical from Core-M 5Watt CPUs to high-power quad core to 18-core Xeon (which actually has a per-core power-budget more like a laptop CPU); it's only the L3 caches, number of cores (and interconnect), and support or not for HT and/or Turbo that differ. Basically everything outside the cores themselves can be different, including the GPU. They don't disable execution ports, and even the L1/L2 caches are identical. I think disabling execution ports would require significant redesigns in the out-of-order scheduler and stuff like that.
More importantly, every port has at least one execution unit that isn't found on any other port: p0 has the divider, p1 has the integer multiply unit, p5 has the shuffle unit, and p6 is the only port that can execute predicted-taken branches. Actually, p2 and p3 are identical load ports (and can handle store-address uops)...
See Agner Fog's microarch pdf for more about Haswell internals, and also David Kanter's writeup with diagrams of the different blocks.
(However, it's not strictly true that the entire core is identical: Haswell Pentium/Celeron CPUs don't support AVX/AVX2, or BMI/BMI2. I think they do that by disabling decode of VEX prefixes in the decoders. This is still the case for Skylake Pentiums/Celerons, so thanks Intel for delaying the time when we can assume support for new instruction sets. Presumably they do this so CPUs with defects in one only the upper or lower half of their vector execution units can still be sold as Celeron or Pentium, just like CPUs with a defect in some of their L3 can be sold as i5 instead of i7)

What is difference between 'Cores across processors' and 'Number of CPUs'?

E.g. Consider following is processor configuration of my machine:
Intel(R) Core(TM)i5 CPU 650 #3.20GHz (4 CPUs)
Then how should i find out how many 'Cores across processors' My machine have?
Is it the 4 cores[i.e. Number of CPU]?
I have referred following links but still i does not get clear idea:
http://www.ehow.com/how_6873203_do-number-core-processors-windows_.html
Can anyone please clear my doubt?
Cores across processors means nothing, or at least, nothing in particular, it's a generic and non-technical assumption/phrase with no exact meaning or no meaning at all.
According to Intel this CPU provides 2 physical cores with Hyper Threading and this mean that you get 4 logical cores or so called threads.
Hyper Threading is an Intel Technology that for each core provides 2 threads, so 2*2 = 4 threads.
I think that this is the closest answer to what you are asking here.
Let's clarify first what is a CPU and what is a core, a central processing unit CPU, can have multiple core units, those cores are a processor by itself, capable of execute a program but it is self contained on the same chip.
In the past one CPU was distributed among quite a few chips, but as Moore's progressed they made to have a complete CPU inside one chip (die), since the 90's the manufacturer's started to fit more cores in the same die, so that's the concept of Multi-core.
In these days is possible to have hundreds of cores on the same CPU (chip or die) GPU's, Intel Xeon. Other technique developed no the 90's was simultaneous multi-threading, basically they found that was possible to have another thread in the same single core CPU, since most of the resources were duplicated already like ALU, multiple registers.
So basically a CPU can have multiple cores each of them capable to run one thread or more at the same time, we may expect to have more cores in the future, but with more difficulty to be able to program efficiently.

Resources