EBS baseline performance too high? - performance

I am trying to benchmark an RDS instance (postgres) on AWS.
I created the instance with a 30 GB "general purpose" SSD volume ("gp2"). according to the AWS docs, this should provide a baseline performance of 100 IOPS:
Between a minimum of 100 IOPS (at 33.33 GiB and below) and a maximum
of 10,000 IOPS (at 3,334 GiB and above), baseline performance scales
linearly at 3 IOPS per GiB of volume size.
but in addition to that, there is burst performance:
When using General Purpose (SSD) storage, your DB instance receives an
initial I/O credit balance of 5.4 million I/O credits, which is enough
to sustain a burst performance of 3,000 IOPS for 30 minutes.
As I'm interested in sustained database performance (= the baseline case), I have to get rid of all I/O credits before starting my tests. I did this by running pgbench.
In the following screenshot, you can see that I start pgbench at 11:00, and around 3 hours later the burst balance is finally used up, and write IOPS drops off:
So far, so good. the timing makes sense -- 3 * 60 * 60 * 600 = 6.48 million (I/O credits are also refilled during the burst).
What I don't understand: why doesn't IOPS drop down to the baseline rate (100), but stay at 380 instead? Is the documented formula for baseline performance not valid any more?
UPDATE: i've shut down this test instance now, but here are the details:

sorry for the delay in my response
Why the extra performance?
With the db.m3.xlarge (which falls under Standard - Previous Edition header) - you have an extra 500 Mbps of additional, dedicated capacity for Amazon Elastic Block Store. This is per the chart and details at this link.
In the first section of Amazon EBS Performance Tips, it says to use EBS optimized instances for increased performance. So, I'd say this was the main reason you were getting the extra IOPS over the 100, after you exhausted your burst credits.
Cost Considerations:
According to the end of the paragraph, having your M3, you will incur extra cost for the extra performance. However, if you were to select the M4, the extra performance incurs no extra cost.
So in sustained database performance cost analysis, I would consider just the base price of the M4 vs. base price of M3 + incurred performance cost the M3 will bring you.
Good luck.

Related

How to get better performace in ProxmoxVE + CEPH cluster

We have been running ProxmoxVE since 5.0 (now in 6.4-15) and we noticed a decay in performance whenever there is some heavy reading/writing.
We have 9 nodes, 7 with CEPH and 56 OSDs (8 on each node). OSDs are hard drives (HDD) WD Gold or better (4~12 Tb). Nodes with 64/128 Gbytes RAM, dual Xeon CPU mainboards (various models).
We already tried simple tests like "ceph tell osd.* bench" getting stable 110 Mb/sec data transfer to each of them with +- 10 Mb/sec spread during normal operations. Apply/Commit Latency is normally below 55 ms with a couple of OSDs reaching 100 ms and one-third below 20 ms.
The front network and back network are both 1 Gbps (separated in VLANs), we are trying to move to 10 Gbps but we found some trouble we are still trying to figure out how to solve (unstable OSDs disconnections).
The Pool is defined as "replicated" with 3 copies (2 needed to keep running). Now the total amount of disk space is 305 Tb (72% used), reweight is in use as some OSDs were getting much more data than others.
Virtual machines run on the same 9 nodes, most are not CPU intensive:
Avg. VM CPU Usage < 6%
Avg. Node CPU Usage < 4.5%
Peak VM CPU Usage 40%
Peak Node CPU Usage 30%
But I/O Wait is a different story:
Avg. Node IO Delay 11
Max. Node IO delay 38
Disk writing load is around 4 Mbytes/sec average, with peaks up to 20 Mbytes/sec.
Anyone with experience in getting better Proxmox+CEPH performance?
Thank you all in advance for taking the time to read,
Ruben.
Got some Ceph pointers that you could follow...
get some good NVMEs (one or two per server but if you have 8HDDs per server 1 should be enough) and put those as DB/WALL (make sure they have power protection)
the ceph tell osd.* bench is not that relevant for real world, I suggest to try some FIO tests see here
set OSD osd_memory_target to at 8G or RAM minimum.
in order to save some write on your HDD (data is not replicated X times) create your RBD pool as EC (erasure coded pool) but please do some research on that because there are some tradeoffs. Recovery takes some extra CPU calculations
All and all, hype-converged clusters are good for training, small projects and medium projects with not such a big workload on them... Keep in mind that planning is gold
Just my 2 cents,
B.

What is reference when it says L1 Cache Reference or Main Memory Reference

So I am trying to learn performance metrics of various components of computer like L1 cache, L2 cache, main memory, ethernet, disk etc as below:
Latency Comparison Numbers
--------------------------
L1 cache **reference** 0.5 ns
Branch mispredict 5 ns
L2 cache **reference** 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory **reference** 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 10,000 ns 10 us
Send 1 KB bytes over 1 Gbps network 10,000 ns 10 us
Read 4 KB randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from 1 Gbps 10,000,000 ns 10,000 us 10 ms 40x memory, 10X SSD
Read 1 MB sequentially from disk 30,000,000 ns 30,000 us 30 ms 120x memory, 30X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
I don't think the reference mentioned above is for how much data is read in bits or bytes. But is actually about maybe accessing one address in cache or memory.
Can someone please explain better what is this reference that's happening in 0.5 n/s ?
This table is listing typical numbers for some representative system,
as the actual values for a real example system would hardly be so "smooth numbers" but complicated sums over some non-even multiples of CPU and/or bus clock periods.
We could find such a table in a textbook for educative use.
This one apparently found its way into a general introduction into system designing1
from some conference presentations Google AI's lead person,
Jeff Dean
held in back in 20093,4.
The two presentation PDFs3,4
do not give an explicit definition what exactly was meant by "reference" in those tables. Instead, the tables are presented to point out that the ability for "back-of-the-envelope calculations" is crucial for successful system design.
The term "reference" likely means retrieving a piece of information from the corresponding level of memory if the requested value is maintained there, so that it doesn't have to be reloaded from a slower source:
L1 cache <- L2 cache <- Main memory (RAM) <- Disk (e.g., swap)
The upper-level sources (RAM, disk) can just be seen as a very rough sketch because here you will find lots of sub-levels and variants (type of mass device, internal cache on the disk's chipset, buses/bridges etc. etc.).
The present numbers appear to be a conclusion of experiences at Google's data center.
Therefore, let's assume they are based on some high-performance class hardware which was relevant in 2009 (or earlier).
Today (2020), the numbers should not be taken literally but to demonstrate the orders of magnitude in the context of the corresponding values for other levels of data transfer.
The label "branch mispredict" stands for all cases when a fetch operation from the next level is necessary, because a mispredicted branching decision is the most important reason for cases when such a fetch operation is critical w. r. t. latencies.
In other cases, branch prediction infrastructure is supposed to trigger data fetch operations in time so all latencies beyond the low "reference" value are hidden behind pipeline operations.
1
The URL you gave us in comment discussion
"Latency numbers every programmer should know" in: "The System Design Primer"
references the following sources:
2
Jeff Dean: "Latency Numbers Every Programmer Should Know", 31 May 2012.
"Originally by Peter Norvig ("Teach Yourself Programming in Ten Years") with some updates from Brendan", 1 Jun 2012.
3
Jeff Dean:
"Designs, Lessons and Advice from Building Large Distributed Systems",
13 Oct 2009,
page 24.
4
Jeff Dean:
"Software Engineering Advice from Building Large-Scale Distributed Systems",
17 Mar 2009,
page 13.
Going to the specific question about what is an L1 cache - it helps to understand multi-level caching -- https://en.wikipedia.org/wiki/CPU_cache#MULTILEVEL
While creating any cache there is a trade-off between hit-rate and latency. Larger caches generally have higher hit rate but also longer latency. To achieve the best of both worlds, many architectures implement 2 or more level of cache - an L1 which is small, super-fast backed by L2 which will be looked up in case of miss from L1, L2 being larger but also slower and so on. The metrics posted in your reference are a rough ballpark of an L1 hit, it would appear.

Approximating Processing Power from CPU-Time

In a particular scenario I found that a code has taken 20 CPU Years and 4 real Months time. My goal is to approximate the amount of processing power utilized considering the fact that all the processors were on 100% usage all the time. So, my approach is as follows,
20 CPU Years = 20 * 365 * 24 CPU Hours = 175,200 CPU Hours.
Now, 1 CPU Year means 1 GFLOP machine working for 1 real Hour. Which means, in this case, the work done is, 1 GFLOP machine working for 175,200 real Hours. But in reality it took 4 * 30 * 24 = 2,880 real hours. So, approximately 175,200/2,880 =(approx.) 61 GLFOP machine.
My question is am I doing the approximation correctly or misunderstanding some particular term as per the calculations given above ? Or I am mixing GFLOPS and GFLOP together ?
Definitions
My question is am I doing the approximation correctly or misunderstanding some particular term as per the calculations given above ?
"100% usage" may mean the CPU spent 20% of its time doing nothing waiting for data to be transferred to/from RAM (and/or branch mispredictions or other stalls), 10% of its time running faster than normal because other CPUs where actually doing nothing, and 15% of its time running slower than normal for power/temperature management reasons; and (depending on where you got that "100% usage" statistic) "100% usage" may be significantly more confusing (e.g. http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html ).
Depending on context; GFLOPS is either "theoretical maximum under perfect conditions that will never occur in practice" (worthless marketing hype); or a direct measurement of a specific case that ignores most of the work a CPU did (everything involving integers, all control flow, all data transfer, all memory management, ...)
In a particular scenario I found that a code has taken 20 CPU Years and 4 real Months time. My goal is to approximate the amount of processing power utilized.
From this; you might (or might not) be able to say "most of the work that CPUs did was discarded due to lockless algorithm retries and/or transactions that couldn't be committed; and (partly because the bottleneck was RAM bandwidth and partly because of the way SMT works on this system) it would have been 4 times as fast if half as many CPUs were used."
TL;DR: Approximating processor power is just an inconvenient way to obfuscate the (more useful) information that you started with (e.g. that a specific piece of code running on a specific piece of hardware that was working on a specific piece of data happened to take 4 months of real time).
Your Calculation:
Yes; you're mixing GFLOP and GFLOPS (e.g. GFLOPS = GFLOP per second; and a "1 GFLOP machine" is a computer that can do a billion floating point operations in an infinite amount of time, which is every computer), and the web page you linked to is making the same mistake (e.g. saying "a 1 GFLOP reference machine" when it should be saying "a 1 GFLOPS reference machine").
Note that there's no need to care about GFLOPS or GFLOP for the calculation you're doing: If something was supposed to take 20 "reference CPU years" and actually took 4 months (or 4/12 years); then you'd say that your hardware is equivalent to "20 / (4/12) = 60 reference CPUs". Of course this is horribly silly and it'd make more sense to say that your hardware happened to achieve 60 GFLOPS without bothering with the misleading "reference CPU" nonsense.

Why can't my ultraportable laptop CPU maintain peak performance in HPC

I have developed a high performance Cholesky factorization routine, which should have peak performance at around 10.5 GFLOPs on a single CPU (without hyperthreading). But there is some phenomenon which I don't understand when I test its performance. In my experiment, I measured the performance with increasing matrix dimension N, from 250 up to 10000.
In my algorithm I have applied caching (with tuned blocking factor), and data are always accessed with unit stride during computation, so cache performance is optimal; TLB and paging problem are eliminated;
I have 8GB available RAM, and the maximum memory footprint during experiment is under 800MB, so no swapping comes across;
During experiment, no resource demanding process like web browser is running at the same time. Only some really cheap background process is running to record CPU frequency as well as CPU temperature data every 2s.
I would expect the performance (in GFLOPs) should maintain at around 10.5 for whatever N I am testing. But a significant performance drop is observed in the middle of the experiment as shown in the first figure.
CPU frequency and CPU temperature are seen in the 2nd and 3rd figure. The experiment finishes in 400s. Temperature was at 51 degree when experiment started, and quickly rose up to 72 degree when CPU got busy. After that it grew slowly to the highest at 78 degree. CPU frequency is basically stable, and it did not drop when temperature got high.
So, my question is:
since CPU frequency did not drop, why performance suffers?
how exactly does temperature affect CPU performance? Does the increment from 72 degree to 78 degree really make things worse?
CPU info
System: Ubuntu 14.04 LTS
Laptop model: Lenovo-YOGA-3-Pro-1370
Processor: Intel Core M-5Y71 CPU # 1.20 GHz * 2
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0,1
Off-line CPU(s) list: 2,3
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 61
Stepping: 4
CPU MHz: 1474.484
BogoMIPS: 2799.91
Virtualisation: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 4096K
NUMA node0 CPU(s): 0,1
CPU 0, 1
driver: intel_pstate
CPUs which run at the same hardware frequency: 0, 1
CPUs which need to have their frequency coordinated by software: 0, 1
maximum transition latency: 0.97 ms.
hardware limits: 500 MHz - 2.90 GHz
available cpufreq governors: performance, powersave
current policy: frequency should be within 500 MHz and 2.90 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency is 1.40 GHz.
boost state support:
Supported: yes
Active: yes
update 1 (control experiment)
In my original experiment, CPU is kept busy working from N = 250 to N = 10000. Many people (primarily those whose saw this post before re-editing) suspected that the overheating of CPU is the major reason for performance hit. Then I went back and installed lm-sensors linux package to track such information, and indeed, CPU temperature rose up.
But to complete the picture, I did another control experiment. This time, I give CPU a cooling time between each N. This is achieved by asking the program to pause for a number of seconds at the start of iteration of the loop through N.
for N between 250 and 2500, the cooling time is 5s;
for N between 2750 and 5000, the cooling time is 20s;
for N between 5250 and 7500, the cooling time is 40s;
finally for N between 7750 and 10000, the cooling time is 60s.
Note that the cooling time is much larger than the time spent for computation. For N = 10000, only 30s are needed for Cholesky factorization at peak performance, but I ask for a 60s cooling time.
This is certainly a very uninteresting setting in high performance computing: we want our machine to work all the time at peak performance, until a very large task is completed. So this kind of halt makes no sense. But it helps to better know the effect of temperature on performance.
This time, we see that peak performance is achieved for all N, just as theory supports! The periodic feature of CPU frequency and temperature is the result of cooling and boost. Temperature still has an increasing trend, simply because as N increases, the work load is getting bigger. This also justifies more cooling time for a sufficient cooling down, as I have done.
The achievement of peak performance seems to rule out all effects other than temperature. But this is really annoying. Basically it says that computer will get tired in HPC, so we can't get expected performance gain. Then what is the point of developing HPC algorithm?
OK, here are the new set of plots:
I don't know why I could not upload the 6th figure. SO simply does not allow me to submit the edit when adding the 6th figure. So I am sorry I can't attach the figure for CPU frequency.
update 2 (how I measure CPU frequency and temperature)
Thanks to Zboson for adding the x86 tag. The following bash commands are what I used for measurement:
while true
do
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq >> cpu0_freq.txt ## parameter "freq0"
cat sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq >> cpu1_freq.txt ## parameter "freq1"
sensors | grep "Core 0" >> cpu0_temp.txt ## parameter "temp0"
sensors | grep "Core 1" >> cpu1_temp.txt ## parameter "temp1"
sleep 2
done
Since I did not pin the computation to 1 core, the operating system will alternately use two different cores. It makes more sense to take
freq[i] <- max (freq0[i], freq1[i])
temp[i] <- max (temp0[i], temp1[i])
as the overall measurement.
TL:DR: Your conclusion is correct. Your CPU's sustained performance is nowhere near its peak. This is normal: the peak perf is only available as a short term "bonus" for bursty interactive workloads, above its rated sustained performance, given the light-weight heat-sink, fans, and power-delivery.
You can develop / test on this machine, but benchmarking will be hard. You'll want to run on a cluster, server, or desktop, or at least a gaming / workstation laptop.
From the CPU info you posted, you have a dual-core-with-hyperthreading Intel Core M with a rated sustainable frequency of 1.20 GHz, Broadwell generation. Its max turbo is 2.9GHz, and it's TDP-up sustainable frequency is 1.4GHz (at 6W).
For short bursts, it can run much faster and make much more heat than it requires its cooling system to handle. This is what Intel's "turbo" feature is all about. It lets low-power ultraportable laptops like yours have snappy UI performance in stuff like web browsers, because the CPU load from interactive is almost always bursty.
Desktop/server CPUs (Xeon and i5/i7, but not i3) do still have turbo, but the sustained frequency is much closer to the max turbo. e.g. a Haswell i7-4790k has a sustained "rated" frequency of 4.0GHz. At that frequency and below, it won't use (and convert to heat) more than its rated TDP of 88W. Thus, it needs a cooling system that can handle 88W. When power/current/temperature allow, it can clock up to 4.4GHz and use more than 88W of power. (The sliding window for calculating the power history to keep the sustained power with 88W is sometimes configurable in the BIOS, e.g. 20sec or 5sec. Depending on what code is running, 4.4GHz might not increase the electrical current demand to anywhere near peak. e.g. code with lots of branch mispredicts that's still limited by CPU frequency, but that doesn't come anywhere near saturating the 256b AVX FP units like Prime95 would.)
Your laptop's max turbo is a factor of 2.4x higher than rated frequency. That high-end Haswell desktop CPU can only upclock by 1.1x. The max sustained frequency is already pretty close to the max peak limits, because it's rated to need a good cooling system that can keep up with that kind of heat production. And a solid power supply that can supply that much current.
The purpose of Core M is to have a CPU that can limit itself to ultra low power levels (rated TDP of 4.5 W at 1.2GHz, 6W at 1.4GHz). So the laptop manufacturer can safely design a cooling and power delivery system that's small and light, and only handles that much power. The "Scenario Design Power" is only 3.5W, and that's supposed to represent the thermal requirements for real-world code, not max-power stuff like Prime95.
Even a "normal" ULV laptop CPU is rated for 15W sustained, and high power gaming/workstation laptop CPUs at 45W. And of course laptop vendors put those CPUs into machines with beefier heat-sinks and fans. See a table on wikipedia, and compare desktop / server CPUs (also on the same page).
The achievement of peak performance seems to rule out all effects
other than temperature. But this is really annoying. Basically it says
that computer will get tired in HPC, so we can't get expected
performance gain. Then what is the point of developing HPC algorithm?
The point is to run them on hardware that's not so badly thermally limited! An ultra-low-power CPU like a Core M makes a decent dev platform, but not a good HPC compute platform.
Even a laptop with an xxxxM CPU, rather than a xxxxU CPU, will do ok. (e.g. a "gaming" or "workstation" laptop that's designed to run CPU-intensive stuff for sustained periods). Or in Skylake-family, "xxxxH" or "HK" are the 45W mobile CPUs, at least quad-core.
Further reading:
Modern Microprocessors
A 90-Minute Guide!
[Power Delivery in a Modern Processor] - general background, including the "power wall" that Pentium 4 ran into.
(https://www.realworldtech.com/power-delivery/) - really deep technical dive into CPU / motherboard design and the challenges of delivering stable low-voltage to very bursty demands, and reacting quickly to the CPU requesting more / less voltage as it changes frequency.

EC2 instance types's exact network performance?

I cannot find exact network performance details for different EC2 instance types on Amazon. Instead, they are only saying:
High
Moderate
Low
What does this even mean? I especially want to know the exact amount of Traffic-OUT on each instance type.
I need to do live streaming and my stream bit rate will be 240kbps. So I need to know which instance type can handle how many concurrent viewers.
Bandwidth is tiered by instance size, here's a comprehensive answer:
For t2/m3/c3/c4/r3/i2/d2 instances:
t2.nano = ??? (Based on the scaling factors, I'd expect 20-30 MBit/s)
t2.micro = ~70 MBit/s (qiita says 63 MBit/s) - t1.micro gets about ~100 Mbit/s
t2.small = ~125 MBit/s (t2, qiita says 127 MBit/s, cloudharmony says 125 Mbit/s with spikes to 200+ Mbit/s)
*.medium = t2.medium gets 250-300 MBit/s, m3.medium ~400 MBit/s
*.large = ~450-600 MBit/s (the most variation, see below)
*.xlarge = 700-900 MBit/s
*.2xlarge = ~1 GBit/s +- 10%
*.4xlarge = ~2 GBit/s +- 10%
*.8xlarge and marked specialty = 10 Gbit, expect ~8.5 GBit/s, requires enhanced networking & VPC for full throughput
m1 small, medium, and large instances tend to perform higher than expected. c1.medium is another freak, at 800 MBit/s.
I gathered this by combing dozens of sources doing benchmarks (primarily using iPerf & TCP connections). Credit to CloudHarmony & flux7 in particular for many of the benchmarks (note that those two links go to google searches showing the numerous individual benchmarks).
Caveats & Notes:
The large instance size has the most variation reported:
m1.large is ~800 Mbit/s (!!!)
t2.large = ~500 MBit/s
c3.large = ~500-570 Mbit/s (different results from different sources)
c4.large = ~520 MBit/s (I've confirmed this independently, by the way)
m3.large is better at ~700 MBit/s
m4.large is ~445 Mbit/s
r3.large is ~390 Mbit/s
Burstable (T2) instances appear to exhibit burstable networking performance too:
The CloudHarmony iperf benchmarks show initial transfers start at 1 GBit/s and then gradually drop to the sustained levels above after a few minutes. PDF links to reports below:
t2.small (PDF)
t2.medium (PDF)
t2.large (PDF)
Note that these are within the same region - if you're transferring across regions, real performance may be much slower. Even for the larger instances, I'm seeing numbers of a few hundred MBit/s.
FWIW CloudFront supports streaming as well. Might be better than plain streaming from instances.
Almost everything in EC2 is multi-tenant. What the network performance indicates is what priority you will have compared with other instances sharing the same infrastructure.
If you need a guaranteed level of bandwidth, then EC2 will likely not work well for you.

Resources