Major discrepancy between Cassandra coordinator latency and client latency - performance

When I measure our p99 read latency at the coordinator with cassandra.ClientRequest.ReadLatency.p99, I get a time of ~20ms. When I measure it from our client applications using the DataStax Java driver, I get a p99 of ~100ms. The raw round trip time (network overhead) between these machines is ~6ms. Is the remaining discrepancy typical? Or is there some problem to solve here? The only other likely culprit I can think of is garbage collection on the coordinator node.

Delays in network + kernel + driver deserialization + gcs is most likely cause with coordinated omission making them not tracked well. Also how you are measuring them is important, but the drivers metric is what is most likely metric that is interesting to you since thats the time your application sees. Most time outside the ClientRequest metric are things you have to resolve with environment. Although you might wanna make sure you dont have things in blocked state in the NativeTransport stage (tpstats) which will be held up before the request "start time" is marked.
Id recommend trying to use hdr histogram for monitoring as well, since if your using the Metrics timer its use of a sampling reservoir (what driver using by default) is very bad at tracking long tail latencies accurately.

Related

Kubernetes throttling JVM application that isn't hitting CPU quota

I am running a Kotlin Spring Boot based service in a Kubernetes cluster that connects to a PostgreSQL database. Each request takes around 3-5 database calls which partially run in parallel via Kotlin coroutines (with a threadpool backed coroutine context present).
No matter the configuration this services gets throttled heavily after getting hit by real traffic after just starting up. This slowness sometimes persists for 2-3 minutes and often only affects some fresh pods, but not all.
I am looking for new avenues to analyze the problem - here's a succinct list of circumstances / stuff I am already doing:
The usual response time of my service is around 7-20ms while serving 300-400 requests / second per pod
New / autoscaled instances warmup themselfes by doing 15000 HTTP requests against themselfs. The readiness probe is not "up" before this process finishes
We are currently setting a cpu request and limit of 2000m, changing this to 3000m does reduce the issue but the latency still spikes to around 300-400ms which is not acceptable (at most 100ms would be great, 50ms ideal)
The memory is set to 2gb, changing this to 3gb has no significant impact
The pods are allocating 200-300mb/s during peak load, the GC activity does not seem abnormal to me
Switching between GCs (G1 and ZGC) has no impact
We are experiencing pod throttling of around 25-50% (calculated via Kubernetes metrics) while the pod CPU usage is around 40-50%
New pods struggle to take 200-300 requests / sec even though we warm up, curiously enough some pods suffer for long periods. All external factors have been analyzed and disabling most baggage has no impact (this includes testing with disabled tracing, metric collection, disabling Kafka integration and verifying our database load is not maxing out - it's sitting at around 20-30% CPU usage while network and memory usage are way lower)
The throttling is observed in custom load tests which replicates the warmup requests described above
Connecting with visualvm during the load tests and checking the CPU time spent yields no striking issues
This is all done on a managed kubernetes by AWS
All the nodes in our cluster are of the same type (c5.2xlarge of AWS)
Any tools / avenues to investigate are appreciated - thank you! I am still puzzled why my service is getting throttled although its CPU usage is way below 100%. Our nodes are also not affected by the old kernel cfs bug from before kernel 5.6 (not entirely sure in which version it got fixed, we are very recent on our nodes kernel version though).
In the end this all boiled down to missing one part of the equation: I/O bounds.
Imagine if one request takes 10 DB calls, each taking 3 milliseconds to fulfill (including network latency etc.). A single request then takes 10*3 = 30 milliseconds of I/O. The request throughput of one request is then 1000ms / 30ms = 33,33 requests / second. Now if one service instance uses 10 threads to handle requests we get 333,3 requests / seconds as our upper bound of throughput. We can't get any faster than this because we are I/O bottlenecked in regards to our thread count.
And this leaves out multiple factors like:
thread pool size vs. db connection pool size
our service doing non-db related tasks (actual logic, json serialization when the response get fulfilled)
database capacity (was not an issue for us)
TL;DR: You can't get faster when you are I/O bottlenecked, no matter much how CPU you provide. I/O has to be improve if you want your single service instance to have more throughput, this is mostly done by db connection pool sizing in relation to thread pool sizing in relation to db calls per request. We missed this basic (and well known) relation between resources!

Performance Testing: What does fluctuating Response time indicates?

Below is the graph which I received after the performance test execution.
I am confused about the fluctuated response time graph.
NOTE: 1) Throughput graph is also fluctuating. 2) I did not receive any error during test.
It normally indicates that either application under test or JMeter engine is overloaded hence it cannot handle/produce stable load pattern.
Your response time is around 1.5 minutes which seems little bit high to me so I would suggest that you need to monitor the application under test and check:
whether it has enough headroom to operate in terms of CPU, RAM, Network IO, etc. as it might be the case the application is short on RAM and goes swapping and disk IO is much slower than RAM, it can be checked using i.e. JMeter PerfMon Plugin
whether it is properly configured for high loads as its middleware (database, application server, load balancer, etc. need to be tuned, spike-like response time pattern may stand for intensive GC activity
in any case ensure that JMeter is also properly configured for high load and isn't short on resources as if JMeter isn't able to send/receive requests fast enough you will get false-negative results
Single chart never tells the full story, you need to correlate information from all the possible sources, collect log files, etc.
-

Performance Difference bet RAFT Orderer and Orderer with Kafka(Latency, Throughput, TPS)

Did anyone compare performance(Latency, Throughput, TPS) between orderer with Kafka and RAFT Orderer?
I could see here a considerable difference in terms of latency, throughput, and TPS.
I tried with the same setup with the same resource configuration on two different VM(the Only difference is the orderer system).
Note: Used Single orderer in both networks.Fabric Version: 1.4.4
Orderer with Kafka is more efficient than RAFT. I am using the default configuration for RAFT and Kafka.
I tried with a load generator at a rate of 100 TPS. WIth Kafka all parameters are fine(latency- 0.3 to 2 sec) whereas using RAFT, latency is gradually increasing 2 to 15+ seconds, the tx failure rate is also high.
What could be the reason for this considerable difference in terms of TPS, throughput, and latency?
Please correct If I am doing something wrong.
For starters I would not run performance tests using a single orderer. These fault tolerance systems are there to handle distribution and consensus of a distributed system, so by running a single orderer you are fundamentally removing the reason they exist. It's as if you are comparing two sports cars on a dirt road and wonder which is the fastest.
Then there are other things that come into play, such as if you connect the services over TLS, the general network latency as well as how many brokers/nodes you are running.
Chris Ferris performed an initial performance analysis of the two systems prior to the release of Raft, and it seemed it was both faster and could handle almost twice as many transactions per second. You can read his blog post here: Does Hyperledger Fabric perform at scale?
You should also be aware of the double-spending problem and key collisions that can occur if you run a distributed system under high load. You should take necessary steps to avoid this, which can cause a bottle-neck. See this Medium post about collisions, and Hyperledger Fabric's own documentation on setting up a high throughput network.

Algorithms for establishing baselines from time series data

In my app I collect a lot of metrics: hardware/native system metrics (such as CPU load, available memory, swap memory, network IO in terms of packets and bytes sent/received, etc.) as well as JVM metrics (garbage collectins, heap size, thread utilization, etc.) as well as app-level metrics (instrumentations that only have meaning to my app, e.g. # orders per minute, etc.).
Throughout the week, month, year I see trends/patterns in these metrics. For instance when cron jobs all kick off at midnight I see CPU and disk thrashing as reports are being generated, etc.
I'm looking for a way to assess/evaluate metrics as healthy/normal vs unhealthy/abnormal but that takes these patterns into consideration. For instance, if CPU spikes around (+/- 5 minutes) midnight each night, that should be considered "normal" and not set off alerts. But if CPU pins during a "low tide" in the day, say between 11:00 AM and noon, that should definitely cause some red flags to trigger.
I have the ability to store my metrics in a time-series database, if that helps kickstart this analytical process at all, but I don't have the foggiest clue as to what algorithms, methods and strategies I could leverage to establish these cyclical "baselines" that act as a function of time. Obviously, such a system would need to be pre-seeded or even trained with historical data that was mapped to normal/abnormal values (which is why I'm learning towards a time-series DB as the underlying store) but this is new territory for me and I don't even know what to begin Googling so as to get back meaningful/relevant/educated solution candidates in the search results. Any ideas?
You could categorize each metric (CPU load, available memory, swap memory, network IO) with the day and time as bad or good for each metric.
Come up with a set of data for a given time frame with metric values and whether they are good or bad. Train a model using 70% of the data with the good and bad answers in the data.
Then test the trained model using the other 30% of data without the answers to see if you get the predicted results (good,bad) from the model. You could use a classification algorithm.

Socket.io: How to reduce emit delay with many concurrent connections?

Im running a 4-core Amazon EC2 instance(m3.xlarge) with 200.000 concurrent connections with no ressouce problems(each core at 10-20%, memory at 2/14GB). Anyway if i emit a message to all the user connected first on a cpu-core gets it within milliseconds but the last connected user gets it with a delay of 1-3 seconds and each CPU core goes up to 100% for 1-2 seconds. I noticed this problem even at "only" 50k concurrent users(12.5k per core).
How to reduce the delay?
I tried changing redis-adapter to mongo-adapter with no difference.
Im using this code to get sticky sessions on multiple cpu cores:
https://github.com/elad/node-cluster-socket.io
The test was very simple: The clients do just connect and do nothing more. The server only listens for a message and emits to all.
EDIT: I tested single-core without any cluster/adapter logic with 50k clients and the same result.
I published the server, single-core-server, benchmark and html-client in one package: https://github.com/MickL/socket-io-benchmark-kit
OK, let's break this down a bit. 200,000 users on four cores. If perfectly distributed, that's 50,000 users per core. So, if sending a message to a given user takes .1ms each of CPU time, that would take 50,000 * .1ms = 5 seconds to send them all.
If you see CPU utilization go to 100% during this, then a bottleneck probably is CPU and maybe you need more cores on the problem. But, there may be other bottlenecks too such as network bandwidth, network adapters or the redis process. So, one thing to immediately determine is whether your end-to-end time is directly proportional to the number of clusters/CPUs you have? If you drop to 2 cores, does the end-to-end time double? If you go to 8, does it drop in half? If yes for both, that's good news because that means you probably are only running into CPU bottleneck at the moment, not other bottlenecks. If that's the case, then you need to figure out how to make 200,000 emits across multiple clusters more efficient by examining node-cluster-socket.io code and finding ways to optimize your specific situation.
The most optimal the code could be would be to have every CPU do all it's housekeeping to gather exactly what it needs to send to all 50,000 users and then very quickly each CPU does a tight loop sending 50,000 network packets one right after the other. I can't really tell from the redis adapter code whether this is what happens or not.
A much worst case would be where some process gets all 200,000 socket IDs and then goes in a loop to send to each socket ID where in that loop, it has to lookup on redis which server contains that connection and then send a message to that server telling it to send to that socket. That would be a ton less efficient than instructing each server to just send a message to all it's own connected users.
It would be worth trying to figure out (by studying code) where in this spectrum, the socket.io + redis combination is.
Oh, and if you're using an SSL connection for each socket, you are also devoting some CPU to crypto on every send operation. There are ways to offload the SSL processing from your regular CPU (using additional hardware).

Resources