If I understand correctly, the Dapr docs say that a single sidecar uses 0.48 vCPU under the specified conditions.
Does this mean that if a huge app like Netflix or Uber (which use more than a thousand microservices) is going to use Dapr then it will need more than 480 vCPU? It's scary.
Where am I going wrong?
It is difficult to interpret or extrapolate a percentage of overall CPU from that test report. It says...
The Dapr sidecar uses 0.48 vCPU and 23Mb per 1000 requests per second.
Strictly speaking they are just saying that at a thoughtput rate of 1000 requests/sec via the dapr sidecar, 0.48 vCPU is consumed. The Dapr service generating that test load might have been idling on async await statements for responses or it could have been consuming 8.0 vCPUs.
Related
This is a general question.
We have 15 provisioned containers running and our Lambda makes calls to a VPC Endpoint as well as to EFS. We've noticed that during deployments, Lambda duration really spikes. Its also worth noting that we might be over scaled in terms of the number of containers we need.
I understand there are AWS Connection timeouts (60 seconds)(https://code.amazon.com/packages/AWSJavaClientRuntime/blobs/mainline/--/main/com/amazonaws/ClientConfiguration.java#L111).
My general question is: Could the latency be coming from the containers having to reestablish connections to the VPC Endpoint / EFS?
We almost always see latency spikes for these service calls during deployments, and then it tends to smooth out. Interestingly though, if we increase provisioned concurrency too high, we tend to see more spikes than usual. My theory is that most container time out between themselves and VPC-E/EFS, and its these timeouts that we observe as latency spikes when provisioned concurrency is too high.
Is this possible? And is it possible to have too much provisioned concurrency (i.e. we need to hit a sweet spot for the best performance such that they're alway warm)?
Additional note:
We deploy to an alias in 10% / 3 minute increments. Deployments take 30 minutes, and we see these spikes for 30 or more minutes until they taper off. Spikes are more likely with higher provisioned concurrency as mentioned prior.
I am running a Kotlin Spring Boot based service in a Kubernetes cluster that connects to a PostgreSQL database. Each request takes around 3-5 database calls which partially run in parallel via Kotlin coroutines (with a threadpool backed coroutine context present).
No matter the configuration this services gets throttled heavily after getting hit by real traffic after just starting up. This slowness sometimes persists for 2-3 minutes and often only affects some fresh pods, but not all.
I am looking for new avenues to analyze the problem - here's a succinct list of circumstances / stuff I am already doing:
The usual response time of my service is around 7-20ms while serving 300-400 requests / second per pod
New / autoscaled instances warmup themselfes by doing 15000 HTTP requests against themselfs. The readiness probe is not "up" before this process finishes
We are currently setting a cpu request and limit of 2000m, changing this to 3000m does reduce the issue but the latency still spikes to around 300-400ms which is not acceptable (at most 100ms would be great, 50ms ideal)
The memory is set to 2gb, changing this to 3gb has no significant impact
The pods are allocating 200-300mb/s during peak load, the GC activity does not seem abnormal to me
Switching between GCs (G1 and ZGC) has no impact
We are experiencing pod throttling of around 25-50% (calculated via Kubernetes metrics) while the pod CPU usage is around 40-50%
New pods struggle to take 200-300 requests / sec even though we warm up, curiously enough some pods suffer for long periods. All external factors have been analyzed and disabling most baggage has no impact (this includes testing with disabled tracing, metric collection, disabling Kafka integration and verifying our database load is not maxing out - it's sitting at around 20-30% CPU usage while network and memory usage are way lower)
The throttling is observed in custom load tests which replicates the warmup requests described above
Connecting with visualvm during the load tests and checking the CPU time spent yields no striking issues
This is all done on a managed kubernetes by AWS
All the nodes in our cluster are of the same type (c5.2xlarge of AWS)
Any tools / avenues to investigate are appreciated - thank you! I am still puzzled why my service is getting throttled although its CPU usage is way below 100%. Our nodes are also not affected by the old kernel cfs bug from before kernel 5.6 (not entirely sure in which version it got fixed, we are very recent on our nodes kernel version though).
In the end this all boiled down to missing one part of the equation: I/O bounds.
Imagine if one request takes 10 DB calls, each taking 3 milliseconds to fulfill (including network latency etc.). A single request then takes 10*3 = 30 milliseconds of I/O. The request throughput of one request is then 1000ms / 30ms = 33,33 requests / second. Now if one service instance uses 10 threads to handle requests we get 333,3 requests / seconds as our upper bound of throughput. We can't get any faster than this because we are I/O bottlenecked in regards to our thread count.
And this leaves out multiple factors like:
thread pool size vs. db connection pool size
our service doing non-db related tasks (actual logic, json serialization when the response get fulfilled)
database capacity (was not an issue for us)
TL;DR: You can't get faster when you are I/O bottlenecked, no matter much how CPU you provide. I/O has to be improve if you want your single service instance to have more throughput, this is mostly done by db connection pool sizing in relation to thread pool sizing in relation to db calls per request. We missed this basic (and well known) relation between resources!
Did anyone compare performance(Latency, Throughput, TPS) between orderer with Kafka and RAFT Orderer?
I could see here a considerable difference in terms of latency, throughput, and TPS.
I tried with the same setup with the same resource configuration on two different VM(the Only difference is the orderer system).
Note: Used Single orderer in both networks.Fabric Version: 1.4.4
Orderer with Kafka is more efficient than RAFT. I am using the default configuration for RAFT and Kafka.
I tried with a load generator at a rate of 100 TPS. WIth Kafka all parameters are fine(latency- 0.3 to 2 sec) whereas using RAFT, latency is gradually increasing 2 to 15+ seconds, the tx failure rate is also high.
What could be the reason for this considerable difference in terms of TPS, throughput, and latency?
Please correct If I am doing something wrong.
For starters I would not run performance tests using a single orderer. These fault tolerance systems are there to handle distribution and consensus of a distributed system, so by running a single orderer you are fundamentally removing the reason they exist. It's as if you are comparing two sports cars on a dirt road and wonder which is the fastest.
Then there are other things that come into play, such as if you connect the services over TLS, the general network latency as well as how many brokers/nodes you are running.
Chris Ferris performed an initial performance analysis of the two systems prior to the release of Raft, and it seemed it was both faster and could handle almost twice as many transactions per second. You can read his blog post here: Does Hyperledger Fabric perform at scale?
You should also be aware of the double-spending problem and key collisions that can occur if you run a distributed system under high load. You should take necessary steps to avoid this, which can cause a bottle-neck. See this Medium post about collisions, and Hyperledger Fabric's own documentation on setting up a high throughput network.
What Amazon EC2 Instance Types to choose for an application that only receive json, transform, save to database and return a json.
Java(Spring) + PostgreSQL
Expected req/sec 10k.
Your application is CPU bound application and you should choose compute optimized instance, C4 is the latest generation instances in the compute optimized instances.
I had similar application requirement and with c4.xlarge , i could get 40k/min on a single server within SLA of 10 ms for each request. you can also benchmark your application by running a stress test on different types of C4 generation instances.
you must check out https://aws.amazon.com/ec2/instance-types/ doc by AWS on different types of instances and their use cases.
you can also check the CPU usage on your instance by looking into the cloud-watch metrics or running the top command on your linux instance.
Make sure that your instance is not having more than 75% CPU
utilization
You can start with smaller instance and then gradually increase to large server in C4 category, if you see CPU utilization is becoming the bottleneck.This is how i got the perfect instance type for my application , keeping the SLA within 10 ms on server time.
P.S :- in my case DB was also deployed on the same server , so throughput was less , it wil increase if you have DB server installed on other server.
let me know if you need any other info.
Let's say that every request requires 20ms of CPU processing time (thus not taking into account the waits between I/O operations), then each core will be able to process around 50 requests per second. In order to process 10k request per seconds you will need 200 cores, this can be achieved with 16 VCPU with 16 cores each.
Having said that you can then select the right instance for your needs using ec2 selector tool, for instance:
these are all the instance types with 16X16 cores for less than 10k$/y
if otherwise, you're fine with "just" 64 cores in total then take a look at these
If you have other constraints or if my assumptions weren't correct you can change the filters accordingly and choose the best type that suits your needs.
I'm trying to stress-test my Spring RESTful Web Service.
I run my Tomcat server on a Intel Core 2 Duo notebook, 4 GB of RAM. I know it's not a real server machine, but i've only this and it's only for study purpose.
For the test, I run JMeter on a remote machine and communication is through a private WLAN with a central wireless router. I prefer to test this from wireless connection because it would be accessed from mobile clients. With JMeter i run a group of 50 threads, starting one thread per second, then after 50 seconds all threads are running. Each thread sends repeatedly an HTTP request to the server, containing a small JSON object to be processed, and sleeping on each iteration for an amount of time equals to the sum of a 100 milliseconds constant delay and a random value of gaussian distribution with standard deviation of 100 milliseconds. I use some JMeter plugins for graphs.
Here are the results:
I can't figure out why mi hits per seconds doesn't pass the 100 threshold (in the graph they are multiplied per 10), beacuse with this configuration it should have been higher than this value (50 thread sending at least three times would generate 150 hit/sec). I don't get any error message from server, and all seems to work well. I've tried even more and more configurations, but i can't get more than 100 hit/sec.
Why?
[EDIT] Many time I notice a substantial performance degradation from some point on without any visible cause: no error response messages on client, only ok http response messages, and all seems to work well on the server too, but looking at the reports:
As you can notice, something happens between 01:54 and 02:14: hits per sec decreases, and response time increase, okay it could be a server overload, but what about the cpu decreasing? This is not compatible with the congestion hypothesis.
I want to notice that you've chosen very well which rows to display on Composite Graph. It's enough to make some conclusions:
Make note that Hits Per Second perfectly correlates with CPU usage. This means you have "CPU-bound" system, and the maximum performance is mostly limited by CPU. This is very important to remember: server resources spent by Hits, not active users. You may disable your sleep timers at all and still will receive the same 80-90 Hits/s.
The maximum level of CPU is somewhere at 80%, so I assume you run Windows OS (Win7?) on your machine. I used to see that it's impossible to achieve 100% CPU utilization on Windows machine, it just does not allow to spend the last 20%. And if you achieved the maximum, then you see your installation's capacity limit. It just has not enough CPU resources to serve more requests. To fight this bottleneck you should either give more CPU (use another server with higher level CPU hardware), or configure OS to let you use up to 100% (I don't know if it is applicable), or optimize your system (code, OS settings) to spend less CPU to serve single request.
For the second graph I'd suppose something is downloaded via the router, or something happens on JMeter machine. "Something happens" means some task is running. This may be your friend who just wanted to do some "grep error.log", or some scheduled task is running. To pin this down you should look at the router resources and jmeter machine resources at the degradation situation. There must be a process that swallows CPU/DISK/Network.