Cache bandwidth per tick for modern CPUs - performance

What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD?
Please, answer with both theoretical (width of ld/sd unit with its throughput in uOPs/tick) and practical numbers (even memcpy speed tests, or STREAM benchmark), if any.
PS it is question, related to maximal rate of load/store instructions in assembler. There can be theoretical rate of loading (all Instructions Per Tick are widest loads), but processor can give only part of such, a practical limit of loading.

For nehalem: rolfed.com/nehalem/nehalemPaper.pdf
Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache.
128 bit = 16 bytes / clock read
AND
128 bit = 16 bytes / clock write
(can I combine read and write in single cycle?)
The L2 and L3 caches each have a 256-bit port for reading or writing,
but the L3 cache must share its port with three other cores on the chip.
Can L2 and L3 read and write ports be used in single clock?
Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.
Latency (clock ticks), some measured by CPU-Z's latencytool or by lmbench's lat_mem_rd - both uses long linked list walk to correctly measure modern out-of-order cores like Intel Core i7
L1 L2 L3, cycles; mem link
Core 2 3 15 -- 66 ns http://www.anandtech.com/show/2542/5
Core i7-xxx 4 11 39 40c+67ns http://www.anandtech.com/show/2542/5
Itanium 1 5-6 12-17 130-1000 (cycles)
Itanium2 2 6-10 20 35c+160ns http://www.7-cpu.com/cpu/Itanium2.html
AMD K8 12 40-70c +64ns http://www.anandtech.com/show/2139/3
Intel P4 2 19 43 200-210 (cycles) http://www.arsc.edu/files/arsc/phys693_lectures/Performance_I_Arch.pdf
AthlonXP 3k 3 20 180 (cycles) --//--
AthlonFX-51 3 13 125 (cycles) --//--
POWER4 4 12-20 ?? hundreds cycles --//--
Haswell 4 11-12 36 36c+57ns http://www.realworldtech.com/haswell-cpu/5/
And good source on latency data is 7cpu web-site, e.g. for Haswell: http://www.7-cpu.com/cpu/Haswell.html
More about lat_mem_rd program is in its man page or here on SO.

Widest read/writes are 128 bit (16 byte) SSE load/store. L1/L2/L3 caches have different bandwidths and latencies and these are of course CPU-specific. Typical L1 latency is 2 - 4 clocks on modern CPUs but you can usually issue 1 or 2 load instructions per clock.
I suspect there's a more specific question lurking here somewhere - what is it that you are actually trying to achieve ? Do you just want to write the fastest possible memcpy ?

Related

Memory access time in deep cache hierarchy

I am trying to solve an exercise question from Computer architecture textbook. The book includes the equation for calculating memory access time (MAT) for up to L2 cache (eq. below), however the exercise has upto L4 cache and off chip memory access for which I don't understand how to use the equation to calculate Avg MAT.
So, Average memory access time = Hit time_L1 + Miss rate_L1 x (Hit time_L2 + Miss rate_L2xMiss penalty_L2)
In exercise question it mentioned a cache hierarchy as -- >[32 KB L1; 128 KB L2; 2 MB L3; 8 MB L4; off-chip memory] for which the memory access time needs to be calculated.
Given, cache/latency/Miss per thousand instructions values: 32 KB/1/100, 128 KB/2/80, 512 KB/4/50, 2 MB/8/40, 8 MB/16/10. And off chip memory access requires 200 cycles on average. Also, 1000 instructions of a program, an average of 20 memory accesses may exhibit low enough locality and can't be serviced bty 2MB cache which has 20 Miss per thousand instruction.
Could anyone help me to solve the problem?

How to calculate speedup and efficiency in a hybrid CPU and GPU algorithm?

I have an algorithm that I have executed in parallel using only CPU and I have achieved a speedup of 30x. That is, an efficiency equal to 0.93 (efficiency = speedup/cores, i.e. 0.93 = 30/32).
Later I added 2 GPUs (Tesla C2075 of 448 cores each) together to the 32 CPU cores.
To calculate the efficiency including CPUs and GPUs, should I add the amount of GPU cores to the CPU cores? That is, I would calculate the efficiency using 928 cores (32 + 448 + 448 = 928). Or should it be calculated differently?
Speedup and efficiency has been calculated based on what has been said here:
https://software.intel.com/en-us/articles/predicting-and-measuring-parallel-performance
GPUs have bigger "core complex" architectures called "SM" or "CU" with tens of pipelines each. Not "very" similar to "SIMD" of a CPU, they can issue commands in parallel to these pipelines in a "single-threaded" kernel code.
You have counted "cores" in CPU and not SIMD pipelines (which is 4 to 16 times of number of cores) so, it wouldn't be wrong to count SM units of Nvidia or CU of Amd or Slice subset of Intel etc.
Tesla C2075 has 14 SM units so you could add 14 for each GPU (32+14+14).
If you have also used SIMDified code for CPU, then it wouldn't be wrong to count each pipeline of a GPU which is 32 to 192 times the number of SM/CU(like 448 per GPU of yours) (32*SIMD_WIDTH + 448 + 448).
At least this is how I would compute "core efficiency" and "pipeline efficiency". If data transfer to/from GPU is not a bottleneck, efficiency should not drop much after GPUs are added.

Ram real time latency

I read somewhere that to find the real latency of the ram you can use the following rule :
1/((RAMspeed/2)/1000) x CL = True Latency in nanoseconds
ie for DDR1 with 400Mhz clock speed, is it logical to me to divide by 2 to get the FSB speed or the real bus speed which is 200Mhz in this case. So the rule above seems to be correct for the DDR1.
From the other side, DDR2 also doubles the freq of the bus in relation to the previous DDR1 generation (ie 4 bits per clock cycle) according the article "What Every Programmer Should Know About Memory".
So, in the case of a DDR2 with a 800Mhz clock speed, to find the "True Latency" the above rule should be accordingly changed to
1/((RAMspeed/4)/1000) x CL = True Latency in nanoseconds
Is that correct? Because in all the cases I read that the correct way is to take RAMspeed/2 no matter if it's DDR, DDR2, DDR3 or DDR4.
Which is the correct way to get the true latency?
The CAS latency is in memory-bus clock cycles. This is always one half the transfers-per-second number. e.g. DDR3-1600 has a memory clock of 800MHz, doing 1600M transfers per second (during a burst transfer).
DDR2, DDR3, and DDR4 still use a double-pumped 64-bit memory bus (transferring data on the rising and falling edges of the clock signal), not quad-pumped. This is why they're still called Double Data-Rate (DDR) SDRAM.
The FSB speed has nothing to do with it.
On old CPUs without integrated memory controllers, i.e. systems that actually have an FSB, its frequency is often configurable (in the BIOS) separately from the memory speed. See Front Side Bus and RAM speed; on even older systems, the FSB and memory clocks were synchronous.
Normally systems were designed with a fast enough FSB to keep up with the memory controller. Running the FSB at the same clock speed as the memory can reduce latency by avoiding buffering between clock domains.
So yes, the CAS latency in seconds is cycle_count / frequency, or more like your formula
1000ns/us * CL / RAMspeed * 2 transfers/clock, where RAMspeed is in mega-transfers per second.
Higher CL numbers at a higher memory frequency often work out to a similar absolute latency (in seconds). In other words, modern RAM has higher CAS latency timing numbers because more clock cycles happen in the same amount of time.
Bandwidth has vastly improved, while latency has stayed nearly constant, according to these graphs from Crucial which explain CL vs. frequency.
Of course this is not "the memory latency", or the "true" memory latency.
It's the CAS latency of the DRAM itself, and is the most important factor in latency between the memory controller and the DRAM, but is only a part of the latency between a CPU core and memory. There is non-negligible latency inside the CPU between the core and uncore (L3 and memory controller). Uncore is Intel terminology; IDK what AMD calls the parts of the memory hierarchy in their various microarchitectures.
Especially many-core Xeon CPUs have significant latency to L3 / memory controller, because of the large ring bus(es) connecting all the cores. A many-core Xeon has worse L3 and memory latency than a similar dual or quad-core with the same memory and CPU clock frequencies.
This extra latency actually limits single-thread / single-core bandwidth on a big Xeon to worse than a laptop CPU, because a single core can't keep enough requests in flight to fill the memory pipeline with that much latency. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?.
Ok I found the answer.
Everytime the manufacturers increased the memory clock speed they did it in a constant rate which always was the double (2x) of the FSB clock speed. ie
MEM CLK FSB
-------------------
DDR200 100 MHz
DDR266 133 MHz
DDR333 166 MHz
DDR400 200 MHz
DDR2-400 200 MHz
DDR2-533 266 MHz
DDR2-667 333 MHz
DDR2-800 400 MHz
DDR2-1066 533 MHz
DDR3-800 400 MHz
DDR3-1066 533 MHz
DDR3-1333 666 MHz
DDR3-1600 800 MHz
So, the memory module has always the double speed of the FSB.

GT200 Single Precision Peak Performance

I was trying to verify the single precision peak performance of a reference GT200 card.
From http://www.realworldtech.com/gt200/9/, we have two facts about GT200 –
The latency of the fastest operation for an SP core is 4 cycles.
The SFU takes 4 cycles too to finish an operation.
Now, each SM has a total of 8 SPs and 2 SFUs, with each SFU having 4 FP multiply units and these SPs and SFUs can work at the same time as they are on two different ports as explained in their SM level diagrams. Each SP can perform MAD operation.
So, we are looking at 8 MAD operations and 8 MUL operations per 4 SP cycles. This gives us 16 + 8 = 24 operations per 4 SP clock cycles as MAD counts as 2 operations. Since 2 SP clock cycle counts as one shader clock, we have 24/2 = 12 operations per shader clock.
For a reference GT200 card, shader clock = 1296 MHz/s.
Thus, the single precision peak performance must be = 1296 MHz/s * 30 SM * 12 operations per shader clock = 466.560 GFLOPS
This is exactly half of the GFLOPS as reported in the specs. So where am I going wrong?
Edit: After Robert’s pointer to the CUDA Programming Guide that says 8MADs/shader clock can be performed in a GT200 SM, I would have to question how latency and throughput relate to each other in this particular SM.
There is a latency of one OP / 4 SP cycles (as pointed out earlier), thus one MAD every 4 SP cycles, right? We have 8 SPs, so it becomes 8 MADs for every 4 SP cycles in an SM.
Since 2 SP cycles form one shader cycle, so we are left with => 8MADs per 2 shader clock cycles
=> 4 MADs per shader clock.
This doesn’t match with the 8MADs/shader clock from the Programming Guide.
So, what am I doing wrong again?
Latency and throughput are not the same thing.
A cc 1.x SM can retire 8 single precision floating point MAD operations on every clock cycle.
This is the correct formula:
1296 MHz(cycle/s) * 30 SM * (8 SP/SM * 2 flop/cycle per SP + 2 SFU/SM * 4 FPU/SFU * 1 flop/cycle per FPU)
= 622080 Mflop/s + 311040 Mflop/s = 933 GFlop/s single precision
From here
EDIT: The 4-cycle latency you're referring to is the latency of a warp (i.e. 32 threads) MAD instruction, as issued to the SM, not the latency of a single MAD operation on a single SP. The FPU in each SP can generate one MAD result per clock, and there are 8 SP's in one SM, so each SM can generate 8 MAD results per clock. Since a warp (32 threads) MAD instruction requires 32 MAD results, it requires 4 total clocks to complete the warp instruction, as issued to the SPs in the SM.
The FPU in the SP can generate one new MAD result per clock. From the standpoint of instruction issue, the fundamental unit is the warp. Therefore a warp MAD instruction requires 4 clocks to complete.
EDIT2: Responding to question below.
Preface: The FPUs in the SFU are not independently schedulable. They only come into play when an instruction is scheduled to the SFUs. There are 4 FPU per SFU, and an SFU instruction requires 16 cycles (since there are 2 SFU/SM) to complete for a warp. If all 4 FPU in both SFUs were fully utilized, that would be 128 (16x4x2) flops produced during the computation of the SFU instruction, in 16 cycles. This is added to the 256 (16x2x8) total flops that could be generated by the "regular" MAD FPUs in the SM during the same time (16 cycles).
Your question seems to be interpreting the observed benchmark result and this statement in the text:
Table III also shows that the throughput for single-precision
floating point multiplication is 11.2 ops/clock, which means
that multiplication can be issued to both the SP and SFU
units. This suggests that each SFU unit is capable of doing
2 multiplications per cycle, twice the throughput of other
(more complex) instructions that map to this unit.
as an indication of either the throughput of the FPUs in the SFU or else the number of FPUs in the SFU. However you are conflating benchmark data with a theoretical number. The SFU has 4 FPU, but this does not mean that all 4 are independently schedulable for arbitrary arithmetic or instruction streams. Seeing all 4 FPU take on a new floating point instruction in a given cycle may require a specific instruction sequence that the authors haven't used.

Cycles/cost for L1 Cache hit vs. Register on x86?

I remember assuming that an L1 cache hit is 1 cycle (i.e. identical to register access time) in my architecture class, but is that actually true on modern x86 processors?
How many cycles does an L1 cache hit take? How does it compare to register access?
Here's a great article on the subject:
http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/1
To answer your question - yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite costly ;)
PS:
The specifics will vary, but this link has some good ballpark figures:
Approximate cost to access various caches and main memory?
Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
L3 CACHE ~100-300 cycles
Local DRAM ~30 ns (~120 cycles)
Remote DRAM ~100 ns
PPS:
These figures represent much older, slower CPUs, but the ratios basically hold:
http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/2
Level Access Time Typical Size Technology Managed By
----- ----------- ------------ --------- -----------
Registers 1-3 ns ?1 KB Custom CMOS Compiler
Level 1 Cache (on-chip) 2-8 ns 8 KB-128 KB SRAM Hardware
Level 2 Cache (off-chip) 5-12 ns 0.5 MB - 8 MB SRAM Hardware
Main Memory 10-60 ns 64 MB - 1 GB DRAM Operating System
Hard Disk 3M - 10M ns 20 - 100 GB Magnetic Operating System/User
Throughput and latency are different things. You can't just add up cycle costs. For throughput, see Load/stores per cycle for recent CPU architecture generations - 2 loads per clock throughput for most modern microarchitectures. And see How can cache be that fast? for microarchitectural details of load/store execution units, including showing load / store buffers which limit how much memory-level parallelism they can track. The rest of this answer will focus only on latency, which is relevant for workloads that involve pointer-chasing (like linked lists and trees), and how much latency out-of-order exec needs to hide. (L3 Cache misses are usually too long to fully hide.)
Single-cycle cache latency used to be a thing on simple in-order pipelines at lower clock speeds (so each cycle was more nanoseconds), especially with simpler caches (smaller, not as associative, and with a smaller TLB for caches that weren't purely virtually addressed.) e.g. the classic 5-stage RISC pipeline like MIPS I assumes 1 cycle for memory access on a cache hit, with address calculation in EX and memory access in a single MEM pipeline stage, before WB.
Modern high-performance CPUs divide the pipeline up into more stages, allowing each cycle to be shorter. This lets simple instructions like add / or / and run really fast, still 1 cycle latency but at high clock speed.
For more details about cycle-counting and out-of-order execution, see Agner Fog's microarch pdf, and other links in the x86 tag wiki.
Intel Haswell's L1 load-use latency is 4 cycles for pointer-chasing, which is typical of modern x86 CPUs. i.e. how fast mov eax, [eax] can run in a loop, with a pointer that points to itself. (Or for a linked list that hits in cache, easy to microbench with a closed loop). See also Is there a penalty when base+offset is in a different page than the base? That 4-cycle latency special case only applies if the pointer comes directly from another load, otherwise it's 5 cycles.
Load-use latency is 1 cycle higher for SSE/AVX vectors in Intel CPUs.
Store-reload latency is 5 cycles, and is unrelated to cache hit or miss (it's store-forwarding, reading from the store buffer for store data that hasn't yet committed to L1d cache).
As harold commented, register access is 0 cycles. So, for example:
inc eax has 1 cycle latency (just the ALU operation)
add dword [mem], 1 has 6 cycle latency until a load from dword [mem] will be ready. (ALU + store-forwarding). e.g. keeping a loop counter in memory limits a loop to one iteration per 6 cycles.
mov rax, [rsi] has 4 cycle latency from rsi being ready to rax being ready on an L1 hit (L1 load-use latency.)
http://www.7-cpu.com/cpu/Haswell.html has a table of latency per cache (which I'll copy here), and some other experimental numbers, including L2-TLB hit latency (on an L1DTLB miss).
Intel i7-4770 (Haswell), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 32 GB (PC3-12800 cl11 cr2).
L1 Data cache = 32 KB, 64 B/line, 8-WAY.
L1 Instruction cache = 32 KB, 64 B/line, 8-WAY.
L2 cache = 256 KB, 64 B/line, 8-WAY
L3 cache = 8 MB, 64 B/line
L1 Data Cache Latency = 4 cycles for simple access via pointer (mov rax, [rax])
L1 Data Cache Latency = 5 cycles for access with complex address calculation (mov rax, [rsi + rax*8]).
L2 Cache Latency = 12 cycles
L3 Cache Latency = 36 cycles
RAM Latency = 36 cycles + 57 ns
The top-level benchmark page is http://www.7-cpu.com/utils.html, but still doesn't really explain what the different test-sizes mean, but the code is available. The test results include Skylake, which is nearly the same as Haswell in this test.
#paulsm4's answer has a table for a multi-socket Nehalem Xeon, including some remote (other-socket) memory / L3 numbers.
If I remember correctly it's about 1-2 clock cycles but this is an estimate and newer caches may be faster. This is out of a Computer Architecture book I have and this is information for AMD so Intel may be slightly different but I would bound it between 5 and 15 clock cycles which seems like a good estimate to me.
EDIT: Whoops L2 is 10 cycles with TAG access, L1 takes 1 to two cycles, my mistake :\
Actually the cost of the L1 cache hit is almost the same as a cost of register access. It was surprising for me, but this is true, at least for my processor (Athlon 64). Some time ago I written a simple test application to benchmark efficiency of access to the shared data in a multiprocessor system. The application body is a simple memory variable incrementing during the predefined period of time. To make a comapison, I benchmarked non-shared variable at first. And during that activity I captured the result, but then during application disassembling I found that compiler was deceived my expectations and apply unwanted optimisation to my code. It just put variable in the CPU register and increment it iterativetly in the register without memory access. But real surprise was achived after I force compliler to use in-memory variable instead of register variable. On updated application I achived almost the same benchmarking results. Performance degradation was really negligeble (~1-2%) and looks like related to some side effect.
As result:
1) I think you can consider L1 cache as an unmanaged processor registers pool.
2) There is no any sence to apply brutal assambly optimization by forcing compiler store frequently accesing data in processor registers. If they are really frequently accessed, they will live in the L1 cache, and due to this will have same access cost as the processor register.

Resources