Latency of accessing main memory is almost the same order of sending a packet - performance

Looking at Jeff Dean's famous latency guides
Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
One thing which looks somewhat uncanny to me is the time taken to read 1MB sequentially from disk is only 10 times faster than sending a round trip packet across the Atlantic. Can anyone give me more intuition why this feels right.

Q : 1MB SEQ-HDD-READ ~ 10x faster than a CA/NL trans-atlantic RTT - why this feels right?
Some "old" values ( with a few cross-QPI/NUMA updates from 2017 ) to start from:
0.5 ns - CPU L1 dCACHE reference
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - CPU own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1 KB with Zippy PROCESS (+GHz,+SIMD,+multicore tricks)
20,000 ns - Send 2 KB over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
30,000,000 ns - Read 1 MB sequentially from DISK
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
| | | |
| | | ns|
| | us|
| ms|
Trans-Atlantic Network RTT :
Global optical networks work roughly at a speed of light ( 300.000.000 m/s )
LA(CA)-AMS(NL) packet has to travel not the geodetical "distance", but over a set of continental and trans-atlantic "submarine" cables, the length of which is way longer ( see the map )
These factors do not "improve" - only the transport capacity is growing, with add-on latencies introduced in light-amplifiers, retiming units and other L1-PHY / L2-/L3-networking technologies are kept under control, as small as possible.
So the LA(CA)-AMS(NL) RTT will remain, using this technology, the same ~ 150 ms
Using other technology, LEO-Sat Cubes - as an example - the "distance" will only grow from ~ 9000 km P2P, by a pair of additional GND/LEO segments, plus by a few addition LEO/LEO hops, which introduce "longer" distance, add-on hop/hop re-processing latencies and capacity will not get any close to the current optical transports available, so no magic jump "back to the future" is to be expected ( we still miss the DeLorean ).
The HDD Disk :
HDD-s can have very fast and very short transport-path for moving the data, but the READ-ops have to wait for the physical / mechanical operations of the media-reading heads ( that takes most of the time here, not the actual data-transfer to the host RAM )
HDD-s are rotational devices, the disk has to "align" where to start the read, which costs the first about 10 [ms]
HDD-s devices store data into a static structure of heads( 2+, reading physical signals from the magnetic plates' surfaces ):cylinders( concentric circular zones on the plate, into which a cyl-aligned reading-head gets settled by disk-head micro-controller):sector( angular-sections of the cylinder, each carrying a block of the same sized data ~ 4KB, 8KB, ... )
These factors do not "improve" - all commodity produced drives remain at industry selected angular speeds of about { 5k4 | 7k2 | 10k | 15k | 18k }-spins/min (RPM). This means, that if a well-compacted data-layouts are maintained on such a disk, one continuous head:cylinder aligned reading round the whole cylinder will take:
>>> [ 1E3 / ( RPM / 60. ) for RPM in ( 5400, 7200, 10000, 15000, 18000 ) ]
11.1 ms per CYL # 5k4 RPM disk,
8.3 ms per CYL # 7k2 RPM disk,
6.0 ms per CYL # 10k RPM disk,
4.0 ms per CYL # 15k RPM disk,
3.3 ms per CYL # 18k RPM disk.
Data-density is also limited by the magnetic media properties. Spintronics R&D will bring some more densely stored data, yet the last 30 years have been well inside the limits of the reliable magnetic storage.
More is to be expected from a trick to co-parallel-read from several heads at-once, yet this goes against the design of the embedded microcontrollers, so most of the reading goes but sequentially, from one head after another, into the HDD-controller onboard buffers, best if no cyl-to-cyl heads mechanical re-alignment were to take place ( technically this depends on the prior data-to-disc layout, maintained by the O/S and possible care of disk-optimisers ( originally called disk disk-"compression", which just tried to re-align the known sequences of FAT-described data-blocks, so as to follow the most optimal trajectory of head:cyl:sector transitions, depending most on the actual device's head:head and cyl:cyl latencies ). So even the most optimistic data-layout takes ~ 13..21 [ms] to seek-and-read but one head:cyl-path
Laws of Physics decide

Some numbers from 2020.
Load from L1 is 4 cycles on Intel Coffee Lake and Ryzen (0.8nsec on a 5GHz CPU).
Load from memory is ~215 cycles on Intel Coffee Lake (43nsec on a 5GHz CPU). ~280 cycles on Ryzen.

Related

How to get better performace in ProxmoxVE + CEPH cluster

We have been running ProxmoxVE since 5.0 (now in 6.4-15) and we noticed a decay in performance whenever there is some heavy reading/writing.
We have 9 nodes, 7 with CEPH and 56 OSDs (8 on each node). OSDs are hard drives (HDD) WD Gold or better (4~12 Tb). Nodes with 64/128 Gbytes RAM, dual Xeon CPU mainboards (various models).
We already tried simple tests like "ceph tell osd.* bench" getting stable 110 Mb/sec data transfer to each of them with +- 10 Mb/sec spread during normal operations. Apply/Commit Latency is normally below 55 ms with a couple of OSDs reaching 100 ms and one-third below 20 ms.
The front network and back network are both 1 Gbps (separated in VLANs), we are trying to move to 10 Gbps but we found some trouble we are still trying to figure out how to solve (unstable OSDs disconnections).
The Pool is defined as "replicated" with 3 copies (2 needed to keep running). Now the total amount of disk space is 305 Tb (72% used), reweight is in use as some OSDs were getting much more data than others.
Virtual machines run on the same 9 nodes, most are not CPU intensive:
Avg. VM CPU Usage < 6%
Avg. Node CPU Usage < 4.5%
Peak VM CPU Usage 40%
Peak Node CPU Usage 30%
But I/O Wait is a different story:
Avg. Node IO Delay 11
Max. Node IO delay 38
Disk writing load is around 4 Mbytes/sec average, with peaks up to 20 Mbytes/sec.
Anyone with experience in getting better Proxmox+CEPH performance?
Thank you all in advance for taking the time to read,
Ruben.
Got some Ceph pointers that you could follow...
get some good NVMEs (one or two per server but if you have 8HDDs per server 1 should be enough) and put those as DB/WALL (make sure they have power protection)
the ceph tell osd.* bench is not that relevant for real world, I suggest to try some FIO tests see here
set OSD osd_memory_target to at 8G or RAM minimum.
in order to save some write on your HDD (data is not replicated X times) create your RBD pool as EC (erasure coded pool) but please do some research on that because there are some tradeoffs. Recovery takes some extra CPU calculations
All and all, hype-converged clusters are good for training, small projects and medium projects with not such a big workload on them... Keep in mind that planning is gold
Just my 2 cents,
B.

Memory access time in deep cache hierarchy

I am trying to solve an exercise question from Computer architecture textbook. The book includes the equation for calculating memory access time (MAT) for up to L2 cache (eq. below), however the exercise has upto L4 cache and off chip memory access for which I don't understand how to use the equation to calculate Avg MAT.
So, Average memory access time = Hit time_L1 + Miss rate_L1 x (Hit time_L2 + Miss rate_L2xMiss penalty_L2)
In exercise question it mentioned a cache hierarchy as -- >[32 KB L1; 128 KB L2; 2 MB L3; 8 MB L4; off-chip memory] for which the memory access time needs to be calculated.
Given, cache/latency/Miss per thousand instructions values: 32 KB/1/100, 128 KB/2/80, 512 KB/4/50, 2 MB/8/40, 8 MB/16/10. And off chip memory access requires 200 cycles on average. Also, 1000 instructions of a program, an average of 20 memory accesses may exhibit low enough locality and can't be serviced bty 2MB cache which has 20 Miss per thousand instruction.
Could anyone help me to solve the problem?

What is reference when it says L1 Cache Reference or Main Memory Reference

So I am trying to learn performance metrics of various components of computer like L1 cache, L2 cache, main memory, ethernet, disk etc as below:
Latency Comparison Numbers
--------------------------
L1 cache **reference** 0.5 ns
Branch mispredict 5 ns
L2 cache **reference** 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory **reference** 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 10,000 ns 10 us
Send 1 KB bytes over 1 Gbps network 10,000 ns 10 us
Read 4 KB randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from 1 Gbps 10,000,000 ns 10,000 us 10 ms 40x memory, 10X SSD
Read 1 MB sequentially from disk 30,000,000 ns 30,000 us 30 ms 120x memory, 30X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
I don't think the reference mentioned above is for how much data is read in bits or bytes. But is actually about maybe accessing one address in cache or memory.
Can someone please explain better what is this reference that's happening in 0.5 n/s ?
This table is listing typical numbers for some representative system,
as the actual values for a real example system would hardly be so "smooth numbers" but complicated sums over some non-even multiples of CPU and/or bus clock periods.
We could find such a table in a textbook for educative use.
This one apparently found its way into a general introduction into system designing1
from some conference presentations Google AI's lead person,
Jeff Dean
held in back in 20093,4.
The two presentation PDFs3,4
do not give an explicit definition what exactly was meant by "reference" in those tables. Instead, the tables are presented to point out that the ability for "back-of-the-envelope calculations" is crucial for successful system design.
The term "reference" likely means retrieving a piece of information from the corresponding level of memory if the requested value is maintained there, so that it doesn't have to be reloaded from a slower source:
L1 cache <- L2 cache <- Main memory (RAM) <- Disk (e.g., swap)
The upper-level sources (RAM, disk) can just be seen as a very rough sketch because here you will find lots of sub-levels and variants (type of mass device, internal cache on the disk's chipset, buses/bridges etc. etc.).
The present numbers appear to be a conclusion of experiences at Google's data center.
Therefore, let's assume they are based on some high-performance class hardware which was relevant in 2009 (or earlier).
Today (2020), the numbers should not be taken literally but to demonstrate the orders of magnitude in the context of the corresponding values for other levels of data transfer.
The label "branch mispredict" stands for all cases when a fetch operation from the next level is necessary, because a mispredicted branching decision is the most important reason for cases when such a fetch operation is critical w. r. t. latencies.
In other cases, branch prediction infrastructure is supposed to trigger data fetch operations in time so all latencies beyond the low "reference" value are hidden behind pipeline operations.
1
The URL you gave us in comment discussion
"Latency numbers every programmer should know" in: "The System Design Primer"
references the following sources:
2
Jeff Dean: "Latency Numbers Every Programmer Should Know", 31 May 2012.
"Originally by Peter Norvig ("Teach Yourself Programming in Ten Years") with some updates from Brendan", 1 Jun 2012.
3
Jeff Dean:
"Designs, Lessons and Advice from Building Large Distributed Systems",
13 Oct 2009,
page 24.
4
Jeff Dean:
"Software Engineering Advice from Building Large-Scale Distributed Systems",
17 Mar 2009,
page 13.
Going to the specific question about what is an L1 cache - it helps to understand multi-level caching -- https://en.wikipedia.org/wiki/CPU_cache#MULTILEVEL
While creating any cache there is a trade-off between hit-rate and latency. Larger caches generally have higher hit rate but also longer latency. To achieve the best of both worlds, many architectures implement 2 or more level of cache - an L1 which is small, super-fast backed by L2 which will be looked up in case of miss from L1, L2 being larger but also slower and so on. The metrics posted in your reference are a rough ballpark of an L1 hit, it would appear.

Calculating Effective CPI when using write-through/write-back architecture

So I'm trying to understand a homework problem given by an instructor and I'm honestly lost - I understand the concept of write-through/write-back, etc. but I can't figure out the actual calculations needed for the effective CPI, could anyone give me a hand? (The problem follows:
The following table provides the statistics of a cache for a
particular program. It is known that the base CPI (without cache
misses) is 1. It is also known that the memory bus bandwidth (the
bandwidth to transfer data between cache and memory) is 4 bytes per
cycle, and it takes one cycle to send the address before data
transfer. The memory spends 10 cycles to store data from bus or fetch
data to bus. The clock rate used by memory and the bus is a quarter of
the CPU clock rate.
Data reads per 1000 instructions: 100
Data writes per 1000 instructions: 150
Instruction cache miss rate: 0.4%
Data cache miss rate: 3%
Block size in bytes: 32
The effective CPI is the base CPU plus the CPI contribution from cache misses.
The cache miss CPI is the sum of the of instruction cache CPI and data cache CPI.
The cache miss cost is the cost of reading or writing to memory, so we will need that.
The cost in bus cycles is 1 (for the address) plus 10 (memory busy time) + 8 (32 byte blocks size divided by 4 bytes/cycle) = 19 cycles. Multiply this by 4 to get CPU cycles. Total is 76 CPU cycles.
So the cost for I cache misses is .004 * 76 = .304 cycles.
The cost for D caches misses is (.10 + .15) * .03 * 76 = .57 cycles
So the effective CPI is 1 + .304 + .57 = 1.874 cycles.

Cache bandwidth per tick for modern CPUs

What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD?
Please, answer with both theoretical (width of ld/sd unit with its throughput in uOPs/tick) and practical numbers (even memcpy speed tests, or STREAM benchmark), if any.
PS it is question, related to maximal rate of load/store instructions in assembler. There can be theoretical rate of loading (all Instructions Per Tick are widest loads), but processor can give only part of such, a practical limit of loading.
For nehalem: rolfed.com/nehalem/nehalemPaper.pdf
Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache.
128 bit = 16 bytes / clock read
AND
128 bit = 16 bytes / clock write
(can I combine read and write in single cycle?)
The L2 and L3 caches each have a 256-bit port for reading or writing,
but the L3 cache must share its port with three other cores on the chip.
Can L2 and L3 read and write ports be used in single clock?
Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.
Latency (clock ticks), some measured by CPU-Z's latencytool or by lmbench's lat_mem_rd - both uses long linked list walk to correctly measure modern out-of-order cores like Intel Core i7
L1 L2 L3, cycles; mem link
Core 2 3 15 -- 66 ns http://www.anandtech.com/show/2542/5
Core i7-xxx 4 11 39 40c+67ns http://www.anandtech.com/show/2542/5
Itanium 1 5-6 12-17 130-1000 (cycles)
Itanium2 2 6-10 20 35c+160ns http://www.7-cpu.com/cpu/Itanium2.html
AMD K8 12 40-70c +64ns http://www.anandtech.com/show/2139/3
Intel P4 2 19 43 200-210 (cycles) http://www.arsc.edu/files/arsc/phys693_lectures/Performance_I_Arch.pdf
AthlonXP 3k 3 20 180 (cycles) --//--
AthlonFX-51 3 13 125 (cycles) --//--
POWER4 4 12-20 ?? hundreds cycles --//--
Haswell 4 11-12 36 36c+57ns http://www.realworldtech.com/haswell-cpu/5/
And good source on latency data is 7cpu web-site, e.g. for Haswell: http://www.7-cpu.com/cpu/Haswell.html
More about lat_mem_rd program is in its man page or here on SO.
Widest read/writes are 128 bit (16 byte) SSE load/store. L1/L2/L3 caches have different bandwidths and latencies and these are of course CPU-specific. Typical L1 latency is 2 - 4 clocks on modern CPUs but you can usually issue 1 or 2 load instructions per clock.
I suspect there's a more specific question lurking here somewhere - what is it that you are actually trying to achieve ? Do you just want to write the fastest possible memcpy ?

Resources