Improve erlang cowboy performance - performance

We have been using Cowboy in production on our Compute Engine machines on GCP and we started benchmarking and improving the performance of our service to handle more Reqs/sec (in our case since we are in Adtech it is bids/sec).
After isolating and handling a lot of the issues separately we came down to Cowboy optimization, these are our current findings and limitations:
Cowboy setup
We are using Cowboy 2.5 with 200 acceptors and max backlog of 1024
init(Req, _State) ->
T1 = erlang:monotonic_time(),
{ok, BRjson, _} = cowboy_req:read_body(Req),
%% ---- rest of work goes here but is switched off for our test---
erlang:send_after(60, self(), {'RSP', x, no_workers}),
{cowboy_loop, Req, #state{t1 = T1}, hibernate}.
Erlang VM
OTP 21
VM args: -smp auto +P 134217727 +K true +A 64 -rate 1200 +stbt db +scl false +sfwi 500 +spp true +zdbbl 8092
Load
Json requests ~4KB in size. And testing is done using a separate machine on the same internal network (no SSL) using jmeter. All requests are POST with keep-alive
Servers
GCP Compute Engine 10 vcpu cores and 14GB RAM (now and tested before with the 4 vcpu)
Findings
We are able to reach to ~1900 reqs/sec but all CPU cores in htop are showing almost 80% utilization
At 1000 reqs/sec we se cpu utilization at 45-50% per core (still high bearing in mind that no other part of our application is running)
*Note: using the 4 vcpu machine we were able to get close to 700 reqs/sec and memory in all of our tests is barely utilizied or changing with load
QUESTION: How to improve cowboy's performance in terms of cpu usage?

First off, thanks #Pouriya for suggestions--actually, discussing this back and forth made me go back and re-check one of my comments about the right tool for the job. PS: we are on GCP so 72 cores would be out of question at this stage.
Cowboy is great! but it does add a bit of overhead in the critical path of each request--a feature (or issue in my case) that is not needed.
We tested again with Elli (https://github.com/elli-lib/elli) but built a proper testing setup this time and it provided improvement up to 20% ~ exactly what we needed!
If anyone at Cowboy/Ranch team has a way of drastically improving CPU overhead will gladly test since we still use it in our APIs but not the critical path.

Related

How to get better performace in ProxmoxVE + CEPH cluster

We have been running ProxmoxVE since 5.0 (now in 6.4-15) and we noticed a decay in performance whenever there is some heavy reading/writing.
We have 9 nodes, 7 with CEPH and 56 OSDs (8 on each node). OSDs are hard drives (HDD) WD Gold or better (4~12 Tb). Nodes with 64/128 Gbytes RAM, dual Xeon CPU mainboards (various models).
We already tried simple tests like "ceph tell osd.* bench" getting stable 110 Mb/sec data transfer to each of them with +- 10 Mb/sec spread during normal operations. Apply/Commit Latency is normally below 55 ms with a couple of OSDs reaching 100 ms and one-third below 20 ms.
The front network and back network are both 1 Gbps (separated in VLANs), we are trying to move to 10 Gbps but we found some trouble we are still trying to figure out how to solve (unstable OSDs disconnections).
The Pool is defined as "replicated" with 3 copies (2 needed to keep running). Now the total amount of disk space is 305 Tb (72% used), reweight is in use as some OSDs were getting much more data than others.
Virtual machines run on the same 9 nodes, most are not CPU intensive:
Avg. VM CPU Usage < 6%
Avg. Node CPU Usage < 4.5%
Peak VM CPU Usage 40%
Peak Node CPU Usage 30%
But I/O Wait is a different story:
Avg. Node IO Delay 11
Max. Node IO delay 38
Disk writing load is around 4 Mbytes/sec average, with peaks up to 20 Mbytes/sec.
Anyone with experience in getting better Proxmox+CEPH performance?
Thank you all in advance for taking the time to read,
Ruben.
Got some Ceph pointers that you could follow...
get some good NVMEs (one or two per server but if you have 8HDDs per server 1 should be enough) and put those as DB/WALL (make sure they have power protection)
the ceph tell osd.* bench is not that relevant for real world, I suggest to try some FIO tests see here
set OSD osd_memory_target to at 8G or RAM minimum.
in order to save some write on your HDD (data is not replicated X times) create your RBD pool as EC (erasure coded pool) but please do some research on that because there are some tradeoffs. Recovery takes some extra CPU calculations
All and all, hype-converged clusters are good for training, small projects and medium projects with not such a big workload on them... Keep in mind that planning is gold
Just my 2 cents,
B.

How is cpu config for haproxy handled within docker?

I'm wondering about haproxy performance from within a container. To make things simple if I have a vm running haproxy with this cpu config I know what to expect:
nbproc 1
nbthread 8
cpu-map auto:1/1-8 0-7
If I want to port the (whole) config to docker for testing purposes without any fancy swarm magic or setup just docker so that I can understand how things map, I'd imagine that the cpu config gets simpler and that the haproxy instance is meant to scale. I guess I have two questions:
Would you even bother configuring cpu from within an haproxy docker container or would you scale the container from behind a service? Maybe you need both.
Can a single container utilise the above config as though it were running on the system as a daemon? Would docker / containerd even care about this config?
I know having 4 containers each with their own config with the cpu evenly mapped like so wouldn't scale or make any sense:
nbproc 1
nbthread 2
cpu-map auto:1/1-2 0-1
nbproc 1
nbthread 2
cpu-map auto:1/3-4 2-3
nbproc 1
nbthread 2
cpu-map auto:1/5-6 4-5
nbproc 1
nbthread 2
cpu-map auto:1/7-8 6-7
But it's this sort of saturation that I'm wondering about. Just how does haproxy / docker handle this sort of cpu nuance?
I've confirmed that there's little to no perceivable impact to service when running haproxy under containerd vs running under systemd using the image provided by haproxy. Running a single container -d with --network host and no limits on cpu or memory at worst I've seen a 2-3% impact on web external latency with live traffic peaked at about 50-60MB/sec, which itself is dependent on throughput and type of requests. On an 8 core vm with 4GB mem (host cpu is xeon 6130 Gold) and a gig interface the memory utilisation is almost identical. cpu performance also remains stable with potential 3-5% increase in utilisation. These tests are private and unpublished.
As far as cpu configuration goes
nbproc 1
nbthread 8
cpu-map auto:1/1-8 0-7
master-worker
This config maps 1:1 between containerd and systemd and yeilds the results already mentioned. The proc and threads will start up under containerd and function as you expect. This takes up about 80-90% of the total cpu (800%) which represents less than 1 fully loaded core at peak. So this container could be scaled with this configuration a further 8 times in theory, 5 or 6 times to leave some headroom.
Also note that any fluctuations in these performance data are likely due to my environment. These tests were taken from a real environment acorss multiple sites not a test bed where I controlled every aspect. Also note depending on your host cpu and load your results will vary wildly.

jmeter performance analysis

I am running performance test for perf environment.
Below is the results:
CPU Utilization
Server Apdex Resp. time Throughput Error Rate CPU usage Memory
per001205 0.970.5 220 ms 2,670 rpm 0.0009 % 493.00% 2.2 GB
per001206 0.950.5 280 ms 2,670 rpm 0.0043 % 516.00% 2.4 GB
per011079 0.830.5 526 ms 2,670 rpm 0.0034 % 598.00% 2.5 GB
per011080 0.670.5 1,110 ms 2,670 rpm 0.0026 % 639.00% 2.6 GB
Can you comment on how the avergage response time? is it accepted?
I can see CPU usage is more than 100% , is it dangerous ?
How should i improve this? i am running it for 250 users.
First of all check out CPU usage mismatch or usage over 100% article.
Consider other monitoring method, i.e. go to hosts directly and check CPU usage via your operating system built-in commands or use JMeter PerfMon plugin to either confirm the picture or get an alternative view of CPU load. Depending on the result you have 2 options:
Either individual servers CPU usage is acceptable and you can decide whether throughput good or not
Or you need to fix the issue in your application code: using profiling tools for the programming language, your application is written in detect the most CPU intensive functions and refactor them to be less processor-time-hungry

Adobe Experience Manager (AEM), Java garbage collection tuning and memory management

I am currently using the Adobe Experience Manager for a Client's site (Java language). It uses openJDK:
#java -version
java version "1.7.0_65"
OpenJDK Runtime Environment (rhel-2.5.1.2.el6_5-x86_64 u65-b17)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
It is running on Rackspace with the following:
vCPU: 4
Memory: 16GB
Guest OS: Red Hat Enterprise Linux 6 (64-bit)
Since it has been in production I have been experiencing very slow performance on the part of the application. It goes like this I launch the app, everything is smooth then 3 to 4 days later the CPU usage spikes to 400% (~4000 users/day hit the site). I got a few OOM exceptions (1 or 2) but mostly the site was exceptionally slow and never becomes an OOM exception. Since I am a novice at Java Memory management I started reading about how it works and found tools like jstat. When the system was overwhelmed the second time around, I ran:
#top
Got the PID of the java process and then pressed shift+H and noted the PIDs of the threads with high CPU percentage. Then I ran
#sudo -uaem jstat <PID>
Got a thread dump and converted the thread PIDs I wrote down previously and searched for their hex value in the dump. After all that, I finally found that it was not surprisingly the Garbage Collector that is flipping out for some reason.
I started reading a lot about Java GC tuning and came up with the following java options.
So restarted the application with the following options:
java
-Dcom.day.crx.persistence.tar.IndexMergeDelay=0
-Djackrabbit.maxQueuedEvents=1000000
-Djava.io.tmpdir=/srv/aem/tmp/
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/srv/aem/tmp/
-Xms8192m -Xmx8192m
-XX:PermSize=256m
-XX:MaxPermSize=1024m
-XX:+UseParallelGC
-XX:+UseParallelOldGC
-XX:ParallelGCThreads=4
-XX:NewRatio=1
-Djava.awt.headless=true
-server
-Dsling.run.modes=publish
-jar crx-quickstart/app/cq-quickstart-6.0.0-standalone.jar start
-c crx-quickstart -i launchpad -p 4503
-Dsling.properties=conf/sling.properties
And it looks like it is performing much better but I think that it probably needs more GC tuning.
When I run:
#sudo -uaem jstat <PID> -gcutils
I get this:
S0 S1 E O P YGC YGCT FGC FGCT GCT
0.00 0.00 55.97 100.00 45.09 4725 521.233 505 4179.584 4700.817
after 4 days that I restarted it.
When I run:
#sudo -uaem jstat <PID> -gccapacity
I get this:
NGCMN NGCMX NGC S0C S1C EC
4194304.0 4194304.0 4194304.0 272896.0 279040.0 3636224.0
OGCMN OGCMX OGC OC PGCMN PGCMX
4194304.0 4194304.0 4194304.0 4194304.0 262144.0 1048576.0
PGC PC YGC FGC
262144.0 262144.0 4725 509
after 4 days that I restarted it.
These result are much better than when I started but I think it can get even better. I'm not really sure what to do next as I'm no GC pro so I was wondering if you guys would have any tips or advice for me on how I could get better app/GC performance and if anything is obvious like ratio's and sizes of youngGen and oldGen ?
How should I set the survivors and eden sizes/ratios ?
Should I change GC type like use CMS GC or G1 ?
How should I proceed ?
Any advice would be helpful.
Best,
Nicola
Young and Old area ratio are interms 1:3 but it could varies depends on the application usage on
short lived objects and long lived objects. If the short lived objects are more then the
young space could be extended for example 2:3 (young:old). Reason for increase in the ratio is
to avoid scavange garbage cycle. When more short lived objects are allocated then the young space
fill fast and lead to scavenge GC cycle inturn affects the application performance. When the ratio
increased then the current value then there are possibilities in the reduction of scavenge GC cycle.
When the young space increased automatically survivor and Eden space increase accordingly.
CMS policy used to reduce pause time of the application and G1 policy targeted for larger memories
with high throughput. Gc policy can be changed based on the need of the application.
Recommended Use Cases for G1 :
The first focus of G1 is to provide a solution for users running applications that require large heaps with limited GC latency.
This means heap sizes of around 6GB or larger, and stable and predictable pause time below 0.5 seconds.
As you use 8G heap size, you can test with G1 gc policy for the same environment in order to check the GC performance.

Eight Socket Server (E7 v2) vs Two Socket Server (E5 v2) Performance

We have a single threaded application that is performing mainly numerical calculations.
We ran this application on the following machines:
(1) Dell 2 Socket (E5-2667 v2) server with 32 GB RAM (1833 Mhz)
(2) IBM 8 Socket (E7-8891 v2) server with 32x8 GB RAM (1600 Mhz)
The application is CPU-bound. Here is a comparison of the chips:
http://ark.intel.com/compare/75273,75259
We were shocked to see that the 8 socket server is about 6 times slower than the 2 socket server!
We are unsure whether the E5 is just much more optimized for floating point calculations (in a way that doesn't show up in the clock speed or cache). Or, maybe it has to do with the way 8 socket server accesses memory (more hops to access RAM). Or, maybe it is something else. Can anyone shed some light on what's going on here?
Some more details:
When we did this performance comparison, the machine was only running this one single-threaded task. We were just testing to compare single-core performance amongst the two machines. We were running a compiled c++ program in a linux environment. We expected both machines to perform similarly because the clock speed, cache size, and memory were all roughly similar between the E7 and E5 chips.

Resources