Performance Difference bet RAFT Orderer and Orderer with Kafka(Latency, Throughput, TPS) - performance

Did anyone compare performance(Latency, Throughput, TPS) between orderer with Kafka and RAFT Orderer?
I could see here a considerable difference in terms of latency, throughput, and TPS.
I tried with the same setup with the same resource configuration on two different VM(the Only difference is the orderer system).
Note: Used Single orderer in both networks.Fabric Version: 1.4.4
Orderer with Kafka is more efficient than RAFT. I am using the default configuration for RAFT and Kafka.
I tried with a load generator at a rate of 100 TPS. WIth Kafka all parameters are fine(latency- 0.3 to 2 sec) whereas using RAFT, latency is gradually increasing 2 to 15+ seconds, the tx failure rate is also high.
What could be the reason for this considerable difference in terms of TPS, throughput, and latency?
Please correct If I am doing something wrong.

For starters I would not run performance tests using a single orderer. These fault tolerance systems are there to handle distribution and consensus of a distributed system, so by running a single orderer you are fundamentally removing the reason they exist. It's as if you are comparing two sports cars on a dirt road and wonder which is the fastest.
Then there are other things that come into play, such as if you connect the services over TLS, the general network latency as well as how many brokers/nodes you are running.
Chris Ferris performed an initial performance analysis of the two systems prior to the release of Raft, and it seemed it was both faster and could handle almost twice as many transactions per second. You can read his blog post here: Does Hyperledger Fabric perform at scale?
You should also be aware of the double-spending problem and key collisions that can occur if you run a distributed system under high load. You should take necessary steps to avoid this, which can cause a bottle-neck. See this Medium post about collisions, and Hyperledger Fabric's own documentation on setting up a high throughput network.

Related

Kubernetes throttling JVM application that isn't hitting CPU quota

I am running a Kotlin Spring Boot based service in a Kubernetes cluster that connects to a PostgreSQL database. Each request takes around 3-5 database calls which partially run in parallel via Kotlin coroutines (with a threadpool backed coroutine context present).
No matter the configuration this services gets throttled heavily after getting hit by real traffic after just starting up. This slowness sometimes persists for 2-3 minutes and often only affects some fresh pods, but not all.
I am looking for new avenues to analyze the problem - here's a succinct list of circumstances / stuff I am already doing:
The usual response time of my service is around 7-20ms while serving 300-400 requests / second per pod
New / autoscaled instances warmup themselfes by doing 15000 HTTP requests against themselfs. The readiness probe is not "up" before this process finishes
We are currently setting a cpu request and limit of 2000m, changing this to 3000m does reduce the issue but the latency still spikes to around 300-400ms which is not acceptable (at most 100ms would be great, 50ms ideal)
The memory is set to 2gb, changing this to 3gb has no significant impact
The pods are allocating 200-300mb/s during peak load, the GC activity does not seem abnormal to me
Switching between GCs (G1 and ZGC) has no impact
We are experiencing pod throttling of around 25-50% (calculated via Kubernetes metrics) while the pod CPU usage is around 40-50%
New pods struggle to take 200-300 requests / sec even though we warm up, curiously enough some pods suffer for long periods. All external factors have been analyzed and disabling most baggage has no impact (this includes testing with disabled tracing, metric collection, disabling Kafka integration and verifying our database load is not maxing out - it's sitting at around 20-30% CPU usage while network and memory usage are way lower)
The throttling is observed in custom load tests which replicates the warmup requests described above
Connecting with visualvm during the load tests and checking the CPU time spent yields no striking issues
This is all done on a managed kubernetes by AWS
All the nodes in our cluster are of the same type (c5.2xlarge of AWS)
Any tools / avenues to investigate are appreciated - thank you! I am still puzzled why my service is getting throttled although its CPU usage is way below 100%. Our nodes are also not affected by the old kernel cfs bug from before kernel 5.6 (not entirely sure in which version it got fixed, we are very recent on our nodes kernel version though).
In the end this all boiled down to missing one part of the equation: I/O bounds.
Imagine if one request takes 10 DB calls, each taking 3 milliseconds to fulfill (including network latency etc.). A single request then takes 10*3 = 30 milliseconds of I/O. The request throughput of one request is then 1000ms / 30ms = 33,33 requests / second. Now if one service instance uses 10 threads to handle requests we get 333,3 requests / seconds as our upper bound of throughput. We can't get any faster than this because we are I/O bottlenecked in regards to our thread count.
And this leaves out multiple factors like:
thread pool size vs. db connection pool size
our service doing non-db related tasks (actual logic, json serialization when the response get fulfilled)
database capacity (was not an issue for us)
TL;DR: You can't get faster when you are I/O bottlenecked, no matter much how CPU you provide. I/O has to be improve if you want your single service instance to have more throughput, this is mostly done by db connection pool sizing in relation to thread pool sizing in relation to db calls per request. We missed this basic (and well known) relation between resources!

Istio Envoy proxy cpu resources

We would like to use Istio for our workload in a production environment. The required CPU resources are documented here: https://istio.io/latest/docs/ops/deployment/performance-and-scalability/
The documentation uses a load test scenario consisting of 1000 requests per second going through envoy proxies (among other parameters). We do have a much lower number of requests, and of course it is important to optimize CPU resources in order to optimize the cost of our solution.
So, the question is: can we reduce the CPU assignment to the envoy proxies proportionally to the reduction of parameters in our scenario? There is a minimum threshold for CPU assignment? I have found nothing about this in the documentation. Of course, we would like to keep a reasonable small latency added by the proxy (similar to the 2.65 ms in the documented scenario).
We can perform some load testing for our solution once we have Istio running; but the CPU additional resources consumed by Istio will be one of the relevant points in deciding whether to go with it or not. So some initial idea about this would be wonderful.

Kubernetes number of replicas vs performance

I have just gotten into Kubernetes and really liking its ability to orchestrate containers. I had the assumption that when the app starts to grow, I can simply increase the replicas to handle the demand. However, now that I have run some benchmarking, the results confuse me.
I am running Laravel 6.2 w/ Apache on GKE with a single g1-small machine as the node. I'm only using NodePort service to expose the app since LoadBalancer seems expensive.
The benchmarking tool used are wrk and ab. When the replicas is increased to 2, requests/s somehow drops. I would expect the requests/s to increase since there are 2 pods available to serve the request. Is there a bottleneck occurring somewhere or perhaps my understanding is flawed. Do hope someone can point out what I'm missing.
A g1-small instance is really tiny: you get 50% utilization of a single core and 1.7 GB of RAM. You don't describe what your application does or how you've profiled it, but if it's CPU-bound, then adding more replicas of the process won't help you at all; you're still limited by the amount of CPU that GCP gives you. If you're hitting the memory limit of the instance that will dramatically reduce your performance, whether you swap or one of the replicas gets OOM-killed.
The other thing that can affect this benchmark is that, sometimes, for a limited time, you can be allowed to burst up to 100% CPU utilization. So if you got an instance and ran the first benchmark, it might have used a burst period and seen higher performance, but then re-running the second benchmark on the same instance might not get to do that.
In short, you can't just crank up the replica count on a Deployment and expect better performance. You need to identify where in the system the actual bottleneck is. Monitoring tools like Prometheus that can report high-level statistics on per-pod CPU utilization can help. In a typical database-backed Web application the database itself is the bottleneck, and there's nothing you can do about that at the Kubernetes level.

Apache Nifi slow cluster issue

I am using a Apache nifi for one of my clickstream projects to do some ETL.
I am getting traffic around 300 messages per second currently with the following infra:
RAM - 16 GB
Swap - 6 GB
CPU - 16 cores
Disk - 100GB (Persistance not required)
Cluster - 6 nodes
The entire cluster UI has become extremely slow with the following issues
Processors giving back pressure when some failure happens, which consumes lot of threads
Provenance writing becomes very slow
Heartbeat across nodes becomes slow
Cluster Heart beat
I have the following questions on the setup
Is RPG use recommended, as it is a HTTP call, which i using to spread
across all the nodes, as there is an existing issue with EMQTT
process for consumer group.
What is the recommended value of thread count that should be allotted
per core?
What are the guidelines for infrastructure sizing
What are the tuning parameters for a large cluster with high incoming requests and lot of heavy JSON parsing for transformation
A couple of suggestions
Yes RPG usage is recommended, at least from what I've experienced, RPG seems to offer better distribution. Take a look at [3] below
Some processors are CPU intensive then others so there's no clear cut answer for what value can be set for Concurrent Tasks. This is more of trial and error or testing and fine tuning approach that you'd have to master. One suggestion is, if you set too many Concurrent Tasks for a CPU intensive processor, it will have serious impact on the nodes.
Hortonworks have made a detailed guide regarding this. I've provided the link below. [1]
Some best practices and handy guides:
https://community.hortonworks.com/articles/7882/hdfnifi-best-practices-for-setting-up-a-high-perfo.html
http://ijokarumawak.github.io/nifi/2016/11/22/nifi-jolt/
https://pierrevillard.com/2017/02/23/listfetch-pattern-and-remote-process-group-in-apache-nifi/

Major discrepancy between Cassandra coordinator latency and client latency

When I measure our p99 read latency at the coordinator with cassandra.ClientRequest.ReadLatency.p99, I get a time of ~20ms. When I measure it from our client applications using the DataStax Java driver, I get a p99 of ~100ms. The raw round trip time (network overhead) between these machines is ~6ms. Is the remaining discrepancy typical? Or is there some problem to solve here? The only other likely culprit I can think of is garbage collection on the coordinator node.
Delays in network + kernel + driver deserialization + gcs is most likely cause with coordinated omission making them not tracked well. Also how you are measuring them is important, but the drivers metric is what is most likely metric that is interesting to you since thats the time your application sees. Most time outside the ClientRequest metric are things you have to resolve with environment. Although you might wanna make sure you dont have things in blocked state in the NativeTransport stage (tpstats) which will be held up before the request "start time" is marked.
Id recommend trying to use hdr histogram for monitoring as well, since if your using the Metrics timer its use of a sampling reservoir (what driver using by default) is very bad at tracking long tail latencies accurately.

Resources