EC2 instance types's exact network performance? - amazon-ec2

I cannot find exact network performance details for different EC2 instance types on Amazon. Instead, they are only saying:
High
Moderate
Low
What does this even mean? I especially want to know the exact amount of Traffic-OUT on each instance type.
I need to do live streaming and my stream bit rate will be 240kbps. So I need to know which instance type can handle how many concurrent viewers.

Bandwidth is tiered by instance size, here's a comprehensive answer:
For t2/m3/c3/c4/r3/i2/d2 instances:
t2.nano = ??? (Based on the scaling factors, I'd expect 20-30 MBit/s)
t2.micro = ~70 MBit/s (qiita says 63 MBit/s) - t1.micro gets about ~100 Mbit/s
t2.small = ~125 MBit/s (t2, qiita says 127 MBit/s, cloudharmony says 125 Mbit/s with spikes to 200+ Mbit/s)
*.medium = t2.medium gets 250-300 MBit/s, m3.medium ~400 MBit/s
*.large = ~450-600 MBit/s (the most variation, see below)
*.xlarge = 700-900 MBit/s
*.2xlarge = ~1 GBit/s +- 10%
*.4xlarge = ~2 GBit/s +- 10%
*.8xlarge and marked specialty = 10 Gbit, expect ~8.5 GBit/s, requires enhanced networking & VPC for full throughput
m1 small, medium, and large instances tend to perform higher than expected. c1.medium is another freak, at 800 MBit/s.
I gathered this by combing dozens of sources doing benchmarks (primarily using iPerf & TCP connections). Credit to CloudHarmony & flux7 in particular for many of the benchmarks (note that those two links go to google searches showing the numerous individual benchmarks).
Caveats & Notes:
The large instance size has the most variation reported:
m1.large is ~800 Mbit/s (!!!)
t2.large = ~500 MBit/s
c3.large = ~500-570 Mbit/s (different results from different sources)
c4.large = ~520 MBit/s (I've confirmed this independently, by the way)
m3.large is better at ~700 MBit/s
m4.large is ~445 Mbit/s
r3.large is ~390 Mbit/s
Burstable (T2) instances appear to exhibit burstable networking performance too:
The CloudHarmony iperf benchmarks show initial transfers start at 1 GBit/s and then gradually drop to the sustained levels above after a few minutes. PDF links to reports below:
t2.small (PDF)
t2.medium (PDF)
t2.large (PDF)
Note that these are within the same region - if you're transferring across regions, real performance may be much slower. Even for the larger instances, I'm seeing numbers of a few hundred MBit/s.

FWIW CloudFront supports streaming as well. Might be better than plain streaming from instances.

Almost everything in EC2 is multi-tenant. What the network performance indicates is what priority you will have compared with other instances sharing the same infrastructure.
If you need a guaranteed level of bandwidth, then EC2 will likely not work well for you.

Related

How to get better performace in ProxmoxVE + CEPH cluster

We have been running ProxmoxVE since 5.0 (now in 6.4-15) and we noticed a decay in performance whenever there is some heavy reading/writing.
We have 9 nodes, 7 with CEPH and 56 OSDs (8 on each node). OSDs are hard drives (HDD) WD Gold or better (4~12 Tb). Nodes with 64/128 Gbytes RAM, dual Xeon CPU mainboards (various models).
We already tried simple tests like "ceph tell osd.* bench" getting stable 110 Mb/sec data transfer to each of them with +- 10 Mb/sec spread during normal operations. Apply/Commit Latency is normally below 55 ms with a couple of OSDs reaching 100 ms and one-third below 20 ms.
The front network and back network are both 1 Gbps (separated in VLANs), we are trying to move to 10 Gbps but we found some trouble we are still trying to figure out how to solve (unstable OSDs disconnections).
The Pool is defined as "replicated" with 3 copies (2 needed to keep running). Now the total amount of disk space is 305 Tb (72% used), reweight is in use as some OSDs were getting much more data than others.
Virtual machines run on the same 9 nodes, most are not CPU intensive:
Avg. VM CPU Usage < 6%
Avg. Node CPU Usage < 4.5%
Peak VM CPU Usage 40%
Peak Node CPU Usage 30%
But I/O Wait is a different story:
Avg. Node IO Delay 11
Max. Node IO delay 38
Disk writing load is around 4 Mbytes/sec average, with peaks up to 20 Mbytes/sec.
Anyone with experience in getting better Proxmox+CEPH performance?
Thank you all in advance for taking the time to read,
Ruben.
Got some Ceph pointers that you could follow...
get some good NVMEs (one or two per server but if you have 8HDDs per server 1 should be enough) and put those as DB/WALL (make sure they have power protection)
the ceph tell osd.* bench is not that relevant for real world, I suggest to try some FIO tests see here
set OSD osd_memory_target to at 8G or RAM minimum.
in order to save some write on your HDD (data is not replicated X times) create your RBD pool as EC (erasure coded pool) but please do some research on that because there are some tradeoffs. Recovery takes some extra CPU calculations
All and all, hype-converged clusters are good for training, small projects and medium projects with not such a big workload on them... Keep in mind that planning is gold
Just my 2 cents,
B.

Improve erlang cowboy performance

We have been using Cowboy in production on our Compute Engine machines on GCP and we started benchmarking and improving the performance of our service to handle more Reqs/sec (in our case since we are in Adtech it is bids/sec).
After isolating and handling a lot of the issues separately we came down to Cowboy optimization, these are our current findings and limitations:
Cowboy setup
We are using Cowboy 2.5 with 200 acceptors and max backlog of 1024
init(Req, _State) ->
T1 = erlang:monotonic_time(),
{ok, BRjson, _} = cowboy_req:read_body(Req),
%% ---- rest of work goes here but is switched off for our test---
erlang:send_after(60, self(), {'RSP', x, no_workers}),
{cowboy_loop, Req, #state{t1 = T1}, hibernate}.
Erlang VM
OTP 21
VM args: -smp auto +P 134217727 +K true +A 64 -rate 1200 +stbt db +scl false +sfwi 500 +spp true +zdbbl 8092
Load
Json requests ~4KB in size. And testing is done using a separate machine on the same internal network (no SSL) using jmeter. All requests are POST with keep-alive
Servers
GCP Compute Engine 10 vcpu cores and 14GB RAM (now and tested before with the 4 vcpu)
Findings
We are able to reach to ~1900 reqs/sec but all CPU cores in htop are showing almost 80% utilization
At 1000 reqs/sec we se cpu utilization at 45-50% per core (still high bearing in mind that no other part of our application is running)
*Note: using the 4 vcpu machine we were able to get close to 700 reqs/sec and memory in all of our tests is barely utilizied or changing with load
QUESTION: How to improve cowboy's performance in terms of cpu usage?
First off, thanks #Pouriya for suggestions--actually, discussing this back and forth made me go back and re-check one of my comments about the right tool for the job. PS: we are on GCP so 72 cores would be out of question at this stage.
Cowboy is great! but it does add a bit of overhead in the critical path of each request--a feature (or issue in my case) that is not needed.
We tested again with Elli (https://github.com/elli-lib/elli) but built a proper testing setup this time and it provided improvement up to 20% ~ exactly what we needed!
If anyone at Cowboy/Ranch team has a way of drastically improving CPU overhead will gladly test since we still use it in our APIs but not the critical path.

EBS baseline performance too high?

I am trying to benchmark an RDS instance (postgres) on AWS.
I created the instance with a 30 GB "general purpose" SSD volume ("gp2"). according to the AWS docs, this should provide a baseline performance of 100 IOPS:
Between a minimum of 100 IOPS (at 33.33 GiB and below) and a maximum
of 10,000 IOPS (at 3,334 GiB and above), baseline performance scales
linearly at 3 IOPS per GiB of volume size.
but in addition to that, there is burst performance:
When using General Purpose (SSD) storage, your DB instance receives an
initial I/O credit balance of 5.4 million I/O credits, which is enough
to sustain a burst performance of 3,000 IOPS for 30 minutes.
As I'm interested in sustained database performance (= the baseline case), I have to get rid of all I/O credits before starting my tests. I did this by running pgbench.
In the following screenshot, you can see that I start pgbench at 11:00, and around 3 hours later the burst balance is finally used up, and write IOPS drops off:
So far, so good. the timing makes sense -- 3 * 60 * 60 * 600 = 6.48 million (I/O credits are also refilled during the burst).
What I don't understand: why doesn't IOPS drop down to the baseline rate (100), but stay at 380 instead? Is the documented formula for baseline performance not valid any more?
UPDATE: i've shut down this test instance now, but here are the details:
sorry for the delay in my response
Why the extra performance?
With the db.m3.xlarge (which falls under Standard - Previous Edition header) - you have an extra 500 Mbps of additional, dedicated capacity for Amazon Elastic Block Store. This is per the chart and details at this link.
In the first section of Amazon EBS Performance Tips, it says to use EBS optimized instances for increased performance. So, I'd say this was the main reason you were getting the extra IOPS over the 100, after you exhausted your burst credits.
Cost Considerations:
According to the end of the paragraph, having your M3, you will incur extra cost for the extra performance. However, if you were to select the M4, the extra performance incurs no extra cost.
So in sustained database performance cost analysis, I would consider just the base price of the M4 vs. base price of M3 + incurred performance cost the M3 will bring you.
Good luck.

Application not running at full speed?

I have the following scenario:
machine 1: receives messages from outside and processes them (via a
Java application). For processing it relies on a database (on machine
2)
machine 2: an Oracle DB
As performance metrics I usually look at the value of processed messages per time.
Now, what puzzles me: none of the 2 machines is working on "full speed". If I look at typical parameters (CPU utilization, CPU load, I/O bandwidth, etc.) both machines look as they have not enough to do.
What I expect is that one machine, or one of the performance related parameters limits the overall processing speed. Since I cannot observe this I would expect a higher message processing rate.
Any ideas what might limit the overall performance? What is the bottleneck?
Here are some key values during workload:
Machine 1:
CPU load average: 0.75
CPU Utilization: System 12%, User 13%, Wait 5%
Disk throughput: 1 MB/s (write), almost no reads
average tps (as reported by iostat): 200
network: 500 kB/s in, 300 kB/s out, 1600 packets/s in, 1600 packets/s out
Machine 2:
CPU load average: 0.25
CPU Utilization: System 3%, User 15%, Wait 17%
Disk throughput: 4.5 MB/s (write), 3.5 MB/s (read)
average tps (as reported by iostat): 190 (very short peaks to 1000-1500)
network: 250 kB/s in, 800 kB/s out, 1100 packets/s in, 1100 packets/s out
So for me, all values seem not to be at any limit.
PS: for testing of course the message queue is always full, so that both machines have enough work to do.
To find bottlenecks you typically need to measure also INSIDE the application. That means profiling the java application code and possibly what happens inside Oracle.
The good news is that you have excluded at least some possible hardware bottlenecks.

How fast is data transfer between instances in Windows Azure?

Suppose I create a Windows Azure application that consists of multiple instances talking to each other by starting a server on each instance and exchanging big chunks of data.
What data transfer speed should I expect from the underlying infrastructure?
It depends a bit on what size your instances are:
XS instance: 5 Mbps max
S: 100 Mbps sustained, ~250 Mbps bursts
M: 200 Mbps sustained, ~ 4-500 Mbps bursts
L: 400 Mbps sustained, upto 800 Mbps bursts
XL: 800 Mbps - you get whole NIC
Those are the limits. There are other factors as well of course:
Are you communicating within a datacenter (sub-region)? Assuming yes here.
Are you using affinity groups? That would put you in same stamp and you could minimize switch traffic - not a huge deal typically as NIC is slowest, but it would help latency a tiny bit. If this is all within a role, you are definitely in same affinity group and same deployment.
Are you writing to disk to buffer communication? Disk IO speeds are different between instances as well. If you are buffering large files or something to disk, you will see overall IO drop as the disk tries to keep up. XL instances have best IO performance.
There are likely other factors as well, but these are what I can think of off the top of my head.

Resources