How is cpu config for haproxy handled within docker? - performance

I'm wondering about haproxy performance from within a container. To make things simple if I have a vm running haproxy with this cpu config I know what to expect:
nbproc 1
nbthread 8
cpu-map auto:1/1-8 0-7
If I want to port the (whole) config to docker for testing purposes without any fancy swarm magic or setup just docker so that I can understand how things map, I'd imagine that the cpu config gets simpler and that the haproxy instance is meant to scale. I guess I have two questions:
Would you even bother configuring cpu from within an haproxy docker container or would you scale the container from behind a service? Maybe you need both.
Can a single container utilise the above config as though it were running on the system as a daemon? Would docker / containerd even care about this config?
I know having 4 containers each with their own config with the cpu evenly mapped like so wouldn't scale or make any sense:
nbproc 1
nbthread 2
cpu-map auto:1/1-2 0-1
nbproc 1
nbthread 2
cpu-map auto:1/3-4 2-3
nbproc 1
nbthread 2
cpu-map auto:1/5-6 4-5
nbproc 1
nbthread 2
cpu-map auto:1/7-8 6-7
But it's this sort of saturation that I'm wondering about. Just how does haproxy / docker handle this sort of cpu nuance?

I've confirmed that there's little to no perceivable impact to service when running haproxy under containerd vs running under systemd using the image provided by haproxy. Running a single container -d with --network host and no limits on cpu or memory at worst I've seen a 2-3% impact on web external latency with live traffic peaked at about 50-60MB/sec, which itself is dependent on throughput and type of requests. On an 8 core vm with 4GB mem (host cpu is xeon 6130 Gold) and a gig interface the memory utilisation is almost identical. cpu performance also remains stable with potential 3-5% increase in utilisation. These tests are private and unpublished.
As far as cpu configuration goes
nbproc 1
nbthread 8
cpu-map auto:1/1-8 0-7
master-worker
This config maps 1:1 between containerd and systemd and yeilds the results already mentioned. The proc and threads will start up under containerd and function as you expect. This takes up about 80-90% of the total cpu (800%) which represents less than 1 fully loaded core at peak. So this container could be scaled with this configuration a further 8 times in theory, 5 or 6 times to leave some headroom.
Also note that any fluctuations in these performance data are likely due to my environment. These tests were taken from a real environment acorss multiple sites not a test bed where I controlled every aspect. Also note depending on your host cpu and load your results will vary wildly.

Related

How to get better performace in ProxmoxVE + CEPH cluster

We have been running ProxmoxVE since 5.0 (now in 6.4-15) and we noticed a decay in performance whenever there is some heavy reading/writing.
We have 9 nodes, 7 with CEPH and 56 OSDs (8 on each node). OSDs are hard drives (HDD) WD Gold or better (4~12 Tb). Nodes with 64/128 Gbytes RAM, dual Xeon CPU mainboards (various models).
We already tried simple tests like "ceph tell osd.* bench" getting stable 110 Mb/sec data transfer to each of them with +- 10 Mb/sec spread during normal operations. Apply/Commit Latency is normally below 55 ms with a couple of OSDs reaching 100 ms and one-third below 20 ms.
The front network and back network are both 1 Gbps (separated in VLANs), we are trying to move to 10 Gbps but we found some trouble we are still trying to figure out how to solve (unstable OSDs disconnections).
The Pool is defined as "replicated" with 3 copies (2 needed to keep running). Now the total amount of disk space is 305 Tb (72% used), reweight is in use as some OSDs were getting much more data than others.
Virtual machines run on the same 9 nodes, most are not CPU intensive:
Avg. VM CPU Usage < 6%
Avg. Node CPU Usage < 4.5%
Peak VM CPU Usage 40%
Peak Node CPU Usage 30%
But I/O Wait is a different story:
Avg. Node IO Delay 11
Max. Node IO delay 38
Disk writing load is around 4 Mbytes/sec average, with peaks up to 20 Mbytes/sec.
Anyone with experience in getting better Proxmox+CEPH performance?
Thank you all in advance for taking the time to read,
Ruben.
Got some Ceph pointers that you could follow...
get some good NVMEs (one or two per server but if you have 8HDDs per server 1 should be enough) and put those as DB/WALL (make sure they have power protection)
the ceph tell osd.* bench is not that relevant for real world, I suggest to try some FIO tests see here
set OSD osd_memory_target to at 8G or RAM minimum.
in order to save some write on your HDD (data is not replicated X times) create your RBD pool as EC (erasure coded pool) but please do some research on that because there are some tradeoffs. Recovery takes some extra CPU calculations
All and all, hype-converged clusters are good for training, small projects and medium projects with not such a big workload on them... Keep in mind that planning is gold
Just my 2 cents,
B.

How is heroku's dyno implemented? (e.g. one dyno = one core CPU in AWS?)

I am studying PAAS archtecture, and found heroku's dyno looks very similar with IAAS's CPU:
1 dyno looks very like a 1 core CPU
can run docker/k8s
can show metrics like CPU usage , like 30% usage
can scale from 1 to 10, very like virtual-machine or virtual CPU
so I want to know,
how does heroku implement a dyno?(e.g. is a dyno simplely equals to a docker?)
how do they get the metrics of a dyno?
1 dyno ~= 1 core cpu docker
I spent several hours read the documents about the open sourced PAAS, like dokku, and I believe this kind of container is based on docker.
1 container in PASS ~= 1 docker
so,
1 poor dyno ~= 1 core cpu 512MB mem docker
1 strong dyno ~= 16 core cpu 32G mem docker
thanks to the docker's API, you can change its config dynamically.
Also you can monitor/metric your dyno , such as: CPU /memory/network/disk usage.
You can implement your own dyno in minutes. that's it.
refer to:
https://docs.docker.com/engine/api/
https://dokku.com/

Docker Container CPU usage Monitoring

As per the documentation of docker.
We can get CPU usage of docker container with docker stats command.
The column CPU % will give the percentage of the host’s CPU the container is using.
Let say I limit the container to use 50% of hosts single CPU. I can specify 50% single CPU core limit by --cpus=0.5 option as per https://docs.docker.com/config/containers/resource_constraints/
How can we get the CPU% usage of container out of allowed CPU core by any docker command?
E.g. Out of 50% Single CPU core, 99% is used.
Is there any way to get it with cadvisor or prometheus?
How can we get the CPU% usage of container out of allowed CPU core by any docker command? E.g. Out of 50% Single CPU core, 99% is used.
Docker has docker stats command which shows CPU/Memory usage and few other stats:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
c43f085dea8c foo_test.1.l5haec5oyr36qdjkv82w9q32r 0.00% 11.15MiB / 100MiB 11.15% 7.45kB / 0B 3.29MB / 8.19kB 9
Though it does show memory usage regarding the limit out of the box, there is no such feature for CPU yet. It is possible to solve that with a script that will calculate the value on the fly, but I'd rather chosen the second option.
Is there any way to get it with cadvisor or prometheus?
Yes, there is:
irate(container_cpu_usage_seconds_total{cpu="total"}[1m])
/ ignoring(cpu)
(container_spec_cpu_quota/container_spec_cpu_period)
The first line is a typical irate function that calculates how much of CPU seconds a container has used. It comes with a label cpu="total", which the second part does not have, and that's why there is ignoring(cpu).
The bottom line calculates how many CPU cores a container is allowed to use. There are two metrics:
container_spec_cpu_quota - the actual quota value. The value is computed of a fraction of CPU cores that you've set as the limit and multiplied by container_spec_cpu_period.
container_spec_cpu_period - comes from CFS Scheduler and it is like a unit of the quota value.
I know it may be hard to grasp at first, allow me to explain on an example:
Consider that you have container_spec_cpu_period set to the default value, which is 100,000 microseconds, and container CPU limit is set to half a core (0.5). In this case:
container_spec_cpu_period 100,000
container_spec_cpu_quota 50,000 # =container_spec_cpu_period*0.5
With CPU limit set to two cores you will have this:
container_spec_cpu_quota 200,000
And so by dividing one by another we get the fraction of CPU cores back, which is then used in another division to calculate how much of the limit is used.

Improve erlang cowboy performance

We have been using Cowboy in production on our Compute Engine machines on GCP and we started benchmarking and improving the performance of our service to handle more Reqs/sec (in our case since we are in Adtech it is bids/sec).
After isolating and handling a lot of the issues separately we came down to Cowboy optimization, these are our current findings and limitations:
Cowboy setup
We are using Cowboy 2.5 with 200 acceptors and max backlog of 1024
init(Req, _State) ->
T1 = erlang:monotonic_time(),
{ok, BRjson, _} = cowboy_req:read_body(Req),
%% ---- rest of work goes here but is switched off for our test---
erlang:send_after(60, self(), {'RSP', x, no_workers}),
{cowboy_loop, Req, #state{t1 = T1}, hibernate}.
Erlang VM
OTP 21
VM args: -smp auto +P 134217727 +K true +A 64 -rate 1200 +stbt db +scl false +sfwi 500 +spp true +zdbbl 8092
Load
Json requests ~4KB in size. And testing is done using a separate machine on the same internal network (no SSL) using jmeter. All requests are POST with keep-alive
Servers
GCP Compute Engine 10 vcpu cores and 14GB RAM (now and tested before with the 4 vcpu)
Findings
We are able to reach to ~1900 reqs/sec but all CPU cores in htop are showing almost 80% utilization
At 1000 reqs/sec we se cpu utilization at 45-50% per core (still high bearing in mind that no other part of our application is running)
*Note: using the 4 vcpu machine we were able to get close to 700 reqs/sec and memory in all of our tests is barely utilizied or changing with load
QUESTION: How to improve cowboy's performance in terms of cpu usage?
First off, thanks #Pouriya for suggestions--actually, discussing this back and forth made me go back and re-check one of my comments about the right tool for the job. PS: we are on GCP so 72 cores would be out of question at this stage.
Cowboy is great! but it does add a bit of overhead in the critical path of each request--a feature (or issue in my case) that is not needed.
We tested again with Elli (https://github.com/elli-lib/elli) but built a proper testing setup this time and it provided improvement up to 20% ~ exactly what we needed!
If anyone at Cowboy/Ranch team has a way of drastically improving CPU overhead will gladly test since we still use it in our APIs but not the critical path.

How to measure performance in Docker?

Is it possible to have performance issues in Docker?
Because I know vm's and you have to specify how much RAM you want to use etc.
But I don't know that in docker. It's just running. Will it automatically use the RAM what it needs or how is this working?
Will it automatically use the RAM what it needs or how is this working?
No by default, it will use the minimum memory needed, up to a limit.
You can use docker stats to see it against a running container:
$ docker stats redis1 redis2
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
redis1 0.07% 796 KB / 64 MB 1.21% 788 B / 648 B 3.568 MB / 512 KB
redis2 0.07% 2.746 MB / 64 MB 4.29% 1.266 KB / 648 B 12.4 MB / 0 B
When you use docker run, you can specify those limits with Runtime constraints on resources.
That includes RAM:
-m, --memory=""
Memory limit (format: <number>[<unit>], where unit = b, k, m or g)
Under normal circumstances, containers can use as much of the memory as needed and are constrained only by the hard limits set with the -m/--memory option.
When memory reservation is set, Docker detects memory contention or low memory and forces containers to restrict their consumption to a reservation limit.
By default, kernel kills processes in a container if an out-of-memory (OOM) error occurs.
To change this behaviour, use the --oom-kill-disable option. Only disable the OOM killer on containers where you have also set the -m/--memory option.
Note: the upcoming (1.10) docker update command might include dynamic memory changes. See docker update.
By default, docker containers are not limited in the amount of resources they can consume from the host. Containers are limited in what permissions / capabilities they have (that's the "container" part).
You should always set constraints on a container, for example, the maximum amount of memory a container is allowed to use, the amount of swap space, and the amount of CPU. Not setting such limits can potentially lead to the host running out of memory, and the kernel killing off random processes (OOM kill), to free up memory. "Random" in that case, can also mean that the kernel kills your ssh server, or the docker daemon itself.
Read more on constraining resources on a container in Runtime constraints on resources in the manual.

Resources