Istio Envoy proxy cpu resources - performance

We would like to use Istio for our workload in a production environment. The required CPU resources are documented here: https://istio.io/latest/docs/ops/deployment/performance-and-scalability/
The documentation uses a load test scenario consisting of 1000 requests per second going through envoy proxies (among other parameters). We do have a much lower number of requests, and of course it is important to optimize CPU resources in order to optimize the cost of our solution.
So, the question is: can we reduce the CPU assignment to the envoy proxies proportionally to the reduction of parameters in our scenario? There is a minimum threshold for CPU assignment? I have found nothing about this in the documentation. Of course, we would like to keep a reasonable small latency added by the proxy (similar to the 2.65 ms in the documented scenario).
We can perform some load testing for our solution once we have Istio running; but the CPU additional resources consumed by Istio will be one of the relevant points in deciding whether to go with it or not. So some initial idea about this would be wonderful.

Related

Performance Difference bet RAFT Orderer and Orderer with Kafka(Latency, Throughput, TPS)

Did anyone compare performance(Latency, Throughput, TPS) between orderer with Kafka and RAFT Orderer?
I could see here a considerable difference in terms of latency, throughput, and TPS.
I tried with the same setup with the same resource configuration on two different VM(the Only difference is the orderer system).
Note: Used Single orderer in both networks.Fabric Version: 1.4.4
Orderer with Kafka is more efficient than RAFT. I am using the default configuration for RAFT and Kafka.
I tried with a load generator at a rate of 100 TPS. WIth Kafka all parameters are fine(latency- 0.3 to 2 sec) whereas using RAFT, latency is gradually increasing 2 to 15+ seconds, the tx failure rate is also high.
What could be the reason for this considerable difference in terms of TPS, throughput, and latency?
Please correct If I am doing something wrong.
For starters I would not run performance tests using a single orderer. These fault tolerance systems are there to handle distribution and consensus of a distributed system, so by running a single orderer you are fundamentally removing the reason they exist. It's as if you are comparing two sports cars on a dirt road and wonder which is the fastest.
Then there are other things that come into play, such as if you connect the services over TLS, the general network latency as well as how many brokers/nodes you are running.
Chris Ferris performed an initial performance analysis of the two systems prior to the release of Raft, and it seemed it was both faster and could handle almost twice as many transactions per second. You can read his blog post here: Does Hyperledger Fabric perform at scale?
You should also be aware of the double-spending problem and key collisions that can occur if you run a distributed system under high load. You should take necessary steps to avoid this, which can cause a bottle-neck. See this Medium post about collisions, and Hyperledger Fabric's own documentation on setting up a high throughput network.

Kubernetes number of replicas vs performance

I have just gotten into Kubernetes and really liking its ability to orchestrate containers. I had the assumption that when the app starts to grow, I can simply increase the replicas to handle the demand. However, now that I have run some benchmarking, the results confuse me.
I am running Laravel 6.2 w/ Apache on GKE with a single g1-small machine as the node. I'm only using NodePort service to expose the app since LoadBalancer seems expensive.
The benchmarking tool used are wrk and ab. When the replicas is increased to 2, requests/s somehow drops. I would expect the requests/s to increase since there are 2 pods available to serve the request. Is there a bottleneck occurring somewhere or perhaps my understanding is flawed. Do hope someone can point out what I'm missing.
A g1-small instance is really tiny: you get 50% utilization of a single core and 1.7 GB of RAM. You don't describe what your application does or how you've profiled it, but if it's CPU-bound, then adding more replicas of the process won't help you at all; you're still limited by the amount of CPU that GCP gives you. If you're hitting the memory limit of the instance that will dramatically reduce your performance, whether you swap or one of the replicas gets OOM-killed.
The other thing that can affect this benchmark is that, sometimes, for a limited time, you can be allowed to burst up to 100% CPU utilization. So if you got an instance and ran the first benchmark, it might have used a burst period and seen higher performance, but then re-running the second benchmark on the same instance might not get to do that.
In short, you can't just crank up the replica count on a Deployment and expect better performance. You need to identify where in the system the actual bottleneck is. Monitoring tools like Prometheus that can report high-level statistics on per-pod CPU utilization can help. In a typical database-backed Web application the database itself is the bottleneck, and there's nothing you can do about that at the Kubernetes level.

Best way to find out and set application resource limits/request on kubernetes

Hope you can help me with this!
What is the best approach to get and set request and limits resource per pods?
I was thinking in setting an expected number of traffic and code some load tests, then start a single pod with some "low limits" and run load test until OOMed, then tune again (something like overclocking) memory until finding a bottleneck, then attack CPU until everything is "stable" and so on. Then i would use that "limit" as a "request value" and would use double of "request values" as "limit" (or a safe value based on results). Finally scale them out for the average of traffic (fixed number of pods) and set autoscale pods rules for peak production values.
Is this a good approach? What tools and metrics do you recommend? I'm using prometheus-operator for monitoring and vegeta for load testing.
What about vertical pod autoscaling? have you used it? is it production ready?
BTW: I'm using AWS managed solution deployed w/ terraform module
Thanks for reading
I usually start my pods with no limits nor resources set. Then I leave them running for a bit under normal load to collect metrics on resource consumption.
I then set memory and CPU requests to +10% of the max consumption I got in the test period and limits to +25% of the requests.
This is just an example strategy, as there is no one size fits all approach for this.
The VerticalPodAutoScaler is more about making sure that a Pod can run. So it starts it low and doubles memory each time it gets OOMKilled. This can potentially lead to a Pod hogging resource. It is also limited as it doesn't take account of under-performance. If your app is under-resourced it might still respond but not respond in a timeframe you consider acceptable.
I think you are taking a good approach as you are looking at the application under load and assessing what it needs to perform as you want it to. I doubt I can suggest any tools you aren't already aware of but if it helps there is some more discussion in How to set the right cpu millicores for a container? and the threads that link from it

How can we determine how much web requests per second a machine can handle without load testing?

What are some of the things that determine how many web requests a single machine can handle? In general what's an average number (requests per second) that most machines should be able to handle? For example, I see some answers that say 2k requests/s can easily be handled. How about 5k? 10k? etc.
I'm basically trying to do my best at estimating how many machines I'd need to scale to some high throughput, before I dive into load testing.
Yes, That is possible through performance modelling but the answers will have 5-10% error margin.
If you know exact size of web request then probably you can find your nw limit and thus this gives you maximum possible request acceptance limit. some exploratory test with sample test you can get the cpu time required by each request roughly(in terms of response time or thoughput). thus you can extrapolate the results for higher number of requests using many theorems examples, little's law. Using this theorem you can find maximum no of users (request here) can be supported on a give hw for a give acceptable response time.
but this all is done after tuning your application to expected level otherwise you will end up with lot of hw because of lack of tuning.

Which are the most Relevant Performance Parameters / Measures for a Web Application

We are re-implementing(yes from scratch) a web application which is currently in production. We have decided to start doing some performance tests on the new app, to get some early information of the capabilities.
As the old application is currently in production and has a good performance we would like to extract some performance parameters, and then use this parameters as a reference or base goal of the performance of the new application.
Which do you think are the most relevant performance parameters we should be obtaining from the current production application?
Thanks!
Determine what pages are used the most.
Measure a latency histogram for the total time it takes to answer the request. Don't just measure the mean, measure a histogram.
From the histogram you can see how many % of requests have which latency in milliseconds. You can choose to key performance indicators by takes the values for 50% and 95%. This will tell you the average latency and the worst latency (for the worst 10% of requests).
Those two numbers alone will bring you great confidence regarding the experience your users will have.
Throughput does not matter for users, but for capacity planning.
I also recommend that you track the performance values over time and review them twice a year.
Just in case you need an HTTP client, there is weighttp, a multi-threaded client written by the guys from Lighttpd.
It has the same syntax used by ApacheBench, but weighttp lets you use several client worker threads (AB is single-threaded so it cannot saturate a modern SMP Web server).
The answer of "usr" is valid, but you can as well record the minnimum, average and maximum latencies (that's useful to see in which range they play). Here is a public-domain C program to automate all this on a given concurrency range.
Disclamer: I am involved in the development of this project.

Resources