I have a simple http server written in Go that accepts big chunks of data (up to 5MB per single request in some cases, but can be even tens of KB, depends on the usage pattern). Then server asynchronously processes received data by adding it to the buffer from which one of the workers (goroutines) pick ups a task. This server is running as a container in kubernetes and has a memory limit set. Also unfortunately I'm not allowed to use HPA as only one pod is allowed per client.
Problem occurs when, someone tries to send a lot of big chunks of data to my server, due to memory limit kubelet kills my container and as a result all data stored in the buffer is lost.
I have tried next ways to mitigate a problem:
Remove memory limit in pod specs. Unfortunately my server is running in multitenant environment and I'm forced to set memory limit.
Limited a number of requests processed in flight by adding a buffered channel and timeout when request can't be added to it in 10 seconds. This has partially mitigated a problem. But first it's quite tricky to find a good balance between a buffer size and timeout, second if client has a lot of small requests server is dropping part of them even if it has a lot of free memory.
I have found that I get current memory usage of my binary by calling runtime.GetMemStats. So my next idea is to drop requests if for example memory goes above some threshold (80%). Is it the only solution to resolve a problem?
Related
I am running a Kotlin Spring Boot based service in a Kubernetes cluster that connects to a PostgreSQL database. Each request takes around 3-5 database calls which partially run in parallel via Kotlin coroutines (with a threadpool backed coroutine context present).
No matter the configuration this services gets throttled heavily after getting hit by real traffic after just starting up. This slowness sometimes persists for 2-3 minutes and often only affects some fresh pods, but not all.
I am looking for new avenues to analyze the problem - here's a succinct list of circumstances / stuff I am already doing:
The usual response time of my service is around 7-20ms while serving 300-400 requests / second per pod
New / autoscaled instances warmup themselfes by doing 15000 HTTP requests against themselfs. The readiness probe is not "up" before this process finishes
We are currently setting a cpu request and limit of 2000m, changing this to 3000m does reduce the issue but the latency still spikes to around 300-400ms which is not acceptable (at most 100ms would be great, 50ms ideal)
The memory is set to 2gb, changing this to 3gb has no significant impact
The pods are allocating 200-300mb/s during peak load, the GC activity does not seem abnormal to me
Switching between GCs (G1 and ZGC) has no impact
We are experiencing pod throttling of around 25-50% (calculated via Kubernetes metrics) while the pod CPU usage is around 40-50%
New pods struggle to take 200-300 requests / sec even though we warm up, curiously enough some pods suffer for long periods. All external factors have been analyzed and disabling most baggage has no impact (this includes testing with disabled tracing, metric collection, disabling Kafka integration and verifying our database load is not maxing out - it's sitting at around 20-30% CPU usage while network and memory usage are way lower)
The throttling is observed in custom load tests which replicates the warmup requests described above
Connecting with visualvm during the load tests and checking the CPU time spent yields no striking issues
This is all done on a managed kubernetes by AWS
All the nodes in our cluster are of the same type (c5.2xlarge of AWS)
Any tools / avenues to investigate are appreciated - thank you! I am still puzzled why my service is getting throttled although its CPU usage is way below 100%. Our nodes are also not affected by the old kernel cfs bug from before kernel 5.6 (not entirely sure in which version it got fixed, we are very recent on our nodes kernel version though).
In the end this all boiled down to missing one part of the equation: I/O bounds.
Imagine if one request takes 10 DB calls, each taking 3 milliseconds to fulfill (including network latency etc.). A single request then takes 10*3 = 30 milliseconds of I/O. The request throughput of one request is then 1000ms / 30ms = 33,33 requests / second. Now if one service instance uses 10 threads to handle requests we get 333,3 requests / seconds as our upper bound of throughput. We can't get any faster than this because we are I/O bottlenecked in regards to our thread count.
And this leaves out multiple factors like:
thread pool size vs. db connection pool size
our service doing non-db related tasks (actual logic, json serialization when the response get fulfilled)
database capacity (was not an issue for us)
TL;DR: You can't get faster when you are I/O bottlenecked, no matter much how CPU you provide. I/O has to be improve if you want your single service instance to have more throughput, this is mostly done by db connection pool sizing in relation to thread pool sizing in relation to db calls per request. We missed this basic (and well known) relation between resources!
We have a cluster of workers that send indexing requests to a 4-node Elasticsearch cluster. The documents are indexed as they are generated, and since the workers have a high degree of concurrency, Elasticsearch is having trouble handling all the requests. To give some numbers, the workers process up to 3,200 tasks at the same time, and each task usually generates about 13 indexing requests. This generates an instantaneous rate that is between 60 and 250 indexing requests per second.
From the start, Elasticsearch had problems and requests were timing out or returning 429. To get around this, we increased the timeout on our workers to 200 seconds and increased the write thread pool queue size on our nodes to 700.
That's not a satisfactory long-term solution though, and I was looking for alternatives. I have noticed that when I copied an index within the same cluster with elasticdump, the write thread pool was almost empty and I attributed that to the fact that elasticdump batches indexing requests and (probably) uses the bulk API to communicate with Elasticsearch.
That gave me the idea that I could write a buffer that receives requests from the workers, batches them in groups of 200-300 requests and then sends the bulk request to Elasticsearch for one group only.
Does such a thing already exist, and does it sound like a good idea?
First of all, it's important to understand what happens behind the scene when you send the index request to Elasticsearch, to troubleshoot the issue or finding the root-cause.
Elasticsearch has several thread pools but for indexing requests(single/bulk) write threadpool is being used, please check this according to your Elasticsearch version as Elastic keeps on changing the threadpools(earlier there was a separate threadpool for single and bulk request with different queue capacity).
In the latest ES version(7.10) write threadpool's queue capacity increased significantly to 10000 from 200(exist in earlier release), there may be below reasons to do it.
Elasticsearch now prefers to buffer more indexing requests instead of rejecting the requests.
Although increasing queue capacity means more latency but it's a trade-off and this will reduce the data-loss if the client doesn't have the retry mechanism.
I am sure, you would have not moved to ES 7.9 version, when capacity was increased, but you can increase the size of this queue slowly and allocate more processors(if you have more capacity) easily through the config change mentioned in this official example. Although this is a very debatable topic and a lot of people consider this as a band-aid solution than the proper fix, but now as Elastic themself increased the queue size, you can also try it, and if you have a short duration of increased traffic than it makes even more sense.
Another critical thing is to find out the root cause why your ES nodes are queuing up more requests, it can be legitimate like increasing indexing traffic and infra reached its limit. but if it's not legitimate you can have a look at my short tips to improve one-time indexing performance and overall indexing performance, by implementing these tips you will get a better indexing rate which will reduce the pressure on write thread pool queue.
Edit: As mentioned by #Val in the comment, if you are also indexing docs one by one then moving to bulk index API will give you the biggest boost.
Im running a 4-core Amazon EC2 instance(m3.xlarge) with 200.000 concurrent connections with no ressouce problems(each core at 10-20%, memory at 2/14GB). Anyway if i emit a message to all the user connected first on a cpu-core gets it within milliseconds but the last connected user gets it with a delay of 1-3 seconds and each CPU core goes up to 100% for 1-2 seconds. I noticed this problem even at "only" 50k concurrent users(12.5k per core).
How to reduce the delay?
I tried changing redis-adapter to mongo-adapter with no difference.
Im using this code to get sticky sessions on multiple cpu cores:
https://github.com/elad/node-cluster-socket.io
The test was very simple: The clients do just connect and do nothing more. The server only listens for a message and emits to all.
EDIT: I tested single-core without any cluster/adapter logic with 50k clients and the same result.
I published the server, single-core-server, benchmark and html-client in one package: https://github.com/MickL/socket-io-benchmark-kit
OK, let's break this down a bit. 200,000 users on four cores. If perfectly distributed, that's 50,000 users per core. So, if sending a message to a given user takes .1ms each of CPU time, that would take 50,000 * .1ms = 5 seconds to send them all.
If you see CPU utilization go to 100% during this, then a bottleneck probably is CPU and maybe you need more cores on the problem. But, there may be other bottlenecks too such as network bandwidth, network adapters or the redis process. So, one thing to immediately determine is whether your end-to-end time is directly proportional to the number of clusters/CPUs you have? If you drop to 2 cores, does the end-to-end time double? If you go to 8, does it drop in half? If yes for both, that's good news because that means you probably are only running into CPU bottleneck at the moment, not other bottlenecks. If that's the case, then you need to figure out how to make 200,000 emits across multiple clusters more efficient by examining node-cluster-socket.io code and finding ways to optimize your specific situation.
The most optimal the code could be would be to have every CPU do all it's housekeeping to gather exactly what it needs to send to all 50,000 users and then very quickly each CPU does a tight loop sending 50,000 network packets one right after the other. I can't really tell from the redis adapter code whether this is what happens or not.
A much worst case would be where some process gets all 200,000 socket IDs and then goes in a loop to send to each socket ID where in that loop, it has to lookup on redis which server contains that connection and then send a message to that server telling it to send to that socket. That would be a ton less efficient than instructing each server to just send a message to all it's own connected users.
It would be worth trying to figure out (by studying code) where in this spectrum, the socket.io + redis combination is.
Oh, and if you're using an SSL connection for each socket, you are also devoting some CPU to crypto on every send operation. There are ways to offload the SSL processing from your regular CPU (using additional hardware).
I'm trying to run RabbitMQ on a small VPS (512mb RAM) along with Nginx and a few other programs. I've been able to tweak the memory usage of everything else without difficulty, but I can't seem to get RabbitMQ to use any less RAM.
I think I need to reduce the number of threads Erlang uses for RabbitMQ, but I've not been able to get it to work. I've also tried setting the vm_memory_high_watermark to a few different values below the default (of 40%), even as low as 5%.
Part of the problem might be that the VPS provider (MediaTemple) allows me to go over my allocated memory, so when using free or top, it shows that the server has around 900mb.
Any suggestions to reduce memory usage by RabbitMQ, or limit the number of threads that Erlang will create? I believe Erlang is using 30 threads, based on the -A30 flag that I've seen on the process command.
Ideally I'd like RabbitMQ mem usage to be below 100mb.
Edit:
With vm_memory_high_watermark set to 5% (or 0.05 in the config file), the RabbitMQ logs report that RabbitMQ's memory limit is set to 51mb. I'm not sure where 51mb is coming from. Current VPS allocated memory is 924mb, so 5% of that should be around 46mb.
According to htop/free before starting up RabbitMQ, I'm sitting around 453mb of used ram, and after start RabbitMQ I'm around 650mb. Nearly 200mb increase. Could it be that 200mb is the lower limit that RabbitMQ will run with?
Edit 2
Here are some screenshots of ps aux and free before and after starting RabbitMQ and a graph showing the memory spike when RabbitMQ is started.
Edit 3
I also checked with no plugins enabled, and it made very little difference. It seems the plugins I had (management and its prerequisites) only added about 8mb of ram usage.
Edit 4
I no longer have this server to test with, however, there is a conf setting delegate_count that is set to a default of 16. As far as I know, this spawns 16 sup-procs for rabbitmq. Lowering this number on smaller servers may help reduce the memory footprint. No idea if this actually works, or how it impacts performance, but it's something to try.
The appropriate way to limit memory usage in RabbitMQ is using the vm_memory_high_watermark. You said:
I've also tried setting the
vm_memory_high_watermark to a few
different values below the default (of
40%), even as low as 5%.
This should work, but it might not be behaving the way you expect. In the logs, you'll find a line that tells you what the absolute memory limit is, something like this:
=INFO REPORT==== 29-Oct-2009::15:43:27 ===
Memory limit set to 2048MB.
You need to tweak the memory limit as needed - Rabbit might be seeing your system as having a lot more RAM than you think it has if you're running on a VPS environment.
Sometimes, Rabbit can't tell what system you're on and uses 1GB as the base point (so you get a limit of 410MB by default).
Also, make sure you are running on a version of RabbitMQ that supports the vm_memory_high_watermark setting - ideally you should run with the latest stable release.
Make sure to set an appropriate QoS prefetch value. By default, if there's a client, the Rabbit server will send any messages it has for that client's queue to the client. This results in extensive memory usage both on the client & the server.
Drop the prefetch limit down to something reasonable, like say 100, and Rabbit will keep the remaining messages on disk on the server until the client is really ready to process them, and your memory usage will go way way down on both the client & the server.
Note that the suggestion of 100 is just a reasonable place to start - it sure beats infinity. To really optimize that number, you'll want to take into consideration the messages/sec your client is able to process, the latency of your network, and also how large each of your messages is on average.
I was trying to put some heavy load on my Redis for testing purposes and find out any upper limits. First I loaded it with 50,000 and 100,000 keys of size 32characters with values around 32 characters. It took no more than 8-15 seconds in both key sizes. Now I try to put 4kb of data as value for each key. First 10000 keys take 800 milli seconds to set. But from that point it slows down gradually and to set whole 50,000 keys it takes aroudn 40 minutes. I am loading the database using NodeJs with node_redis (Mranney) . Is there any mistake I am doing or is Redis just that slow with big values of size 4 KB?
One more thing I found now is when I run another client parallel to the current one and update keys this 2nd client finishes up loading the 50000 keys with 4kb values within 8 seconds while the first client still does its thing forever. Is it a bug in node or the redis library? This is alarming and not acceptable for production.
You'll need to get some kind of back pressure for doing bulk writes from node into Redis. By default, node will queue all writes and does not enforce an upper bound on the outgoing queue size.
node_redis has a "drain" event that you can listen for to implement some rudimentary back pressure.
The default redis configuration is not optimized for that sort of usage. I suspect you have it swapping to disk with a page size of 32 bytes, which means that each key added has to find 128 contiguous free pages and may end up using system VM or needing to expand the swap file a lot.
When you update a key, the space is already allocated so you don't see any performance issues.
Since I was doing lot of set (Key value) in NodeJs which is done asynchronously, lot of socket connections are concurrently open. The NodeJs socket write buffer might be overloaded and GC might come and fiddle with the node process.
PS: I changed redis memory configurations as Tom suggested but it was still performing the same.