I am trying to learn more about monitoring and analysis of lamda functions in my serverless environment, to understand how to point out 'suspect' lambdas that need attention. I have running through some sample queries in Logs Insights sections, and I have a few lambdas that have this result.
I'm basically trying to understand if this is something that needs fixing quickly, or if it's not a big deal if there is so much overProvisioned memory?
Should I be more worried looking at Duration/Concurrency issues than this metric?
TLDR: overprovisioned memory and duration affects billing cost. Both parameters can be controlled where possible to cost-effective values.
Allocated memory, together with duration and number of times the lambda is executed per month is used for computing billing cost for the month. [1]
Currently, the lambda uses roughly 14% of provisioned memory at maximum load, the remaining fraction can be utilised.
If you're serving a huge amount of request, reducing over-provisioned memory and duration can be cost effective.
My recommendation is to provision memory to be sum of max load plus (50% - 75%) of max load and reviewing the duration.
Concurrency doesn't factor in monthly billing cost.
Some numbers: [2]
Default concurrency limit for functions = 100
Hard set concurrency limit for account = 1000
Reducing the duration, means you can serve more requests at a time.
The concurrency limit per account can be increased when requested to the AWS Support.
Another typical workaround for concurrency issues is to throttle requests using a queue. This may be more costly.
The lambda receiving the request creates a new SNS topic, envelopes it together with request, pushes it to a message queue and returns caller the topic.
Caller receives and subscribes to topic.
Another lambda processes the queue and report status for the job to the topic.
Caller receives message.
Account limit for number of topics is set at 100,000 [3].
This limit can be increased by requesting to AWS Support. Although cleaning up topics that are no longer necessary to keep around can be more suitable.
Having to design through this workarounds for concurrency limits could mean that the application requirements are more suited for traditional web application backed by a long running server.
Related
I am running a Kotlin Spring Boot based service in a Kubernetes cluster that connects to a PostgreSQL database. Each request takes around 3-5 database calls which partially run in parallel via Kotlin coroutines (with a threadpool backed coroutine context present).
No matter the configuration this services gets throttled heavily after getting hit by real traffic after just starting up. This slowness sometimes persists for 2-3 minutes and often only affects some fresh pods, but not all.
I am looking for new avenues to analyze the problem - here's a succinct list of circumstances / stuff I am already doing:
The usual response time of my service is around 7-20ms while serving 300-400 requests / second per pod
New / autoscaled instances warmup themselfes by doing 15000 HTTP requests against themselfs. The readiness probe is not "up" before this process finishes
We are currently setting a cpu request and limit of 2000m, changing this to 3000m does reduce the issue but the latency still spikes to around 300-400ms which is not acceptable (at most 100ms would be great, 50ms ideal)
The memory is set to 2gb, changing this to 3gb has no significant impact
The pods are allocating 200-300mb/s during peak load, the GC activity does not seem abnormal to me
Switching between GCs (G1 and ZGC) has no impact
We are experiencing pod throttling of around 25-50% (calculated via Kubernetes metrics) while the pod CPU usage is around 40-50%
New pods struggle to take 200-300 requests / sec even though we warm up, curiously enough some pods suffer for long periods. All external factors have been analyzed and disabling most baggage has no impact (this includes testing with disabled tracing, metric collection, disabling Kafka integration and verifying our database load is not maxing out - it's sitting at around 20-30% CPU usage while network and memory usage are way lower)
The throttling is observed in custom load tests which replicates the warmup requests described above
Connecting with visualvm during the load tests and checking the CPU time spent yields no striking issues
This is all done on a managed kubernetes by AWS
All the nodes in our cluster are of the same type (c5.2xlarge of AWS)
Any tools / avenues to investigate are appreciated - thank you! I am still puzzled why my service is getting throttled although its CPU usage is way below 100%. Our nodes are also not affected by the old kernel cfs bug from before kernel 5.6 (not entirely sure in which version it got fixed, we are very recent on our nodes kernel version though).
In the end this all boiled down to missing one part of the equation: I/O bounds.
Imagine if one request takes 10 DB calls, each taking 3 milliseconds to fulfill (including network latency etc.). A single request then takes 10*3 = 30 milliseconds of I/O. The request throughput of one request is then 1000ms / 30ms = 33,33 requests / second. Now if one service instance uses 10 threads to handle requests we get 333,3 requests / seconds as our upper bound of throughput. We can't get any faster than this because we are I/O bottlenecked in regards to our thread count.
And this leaves out multiple factors like:
thread pool size vs. db connection pool size
our service doing non-db related tasks (actual logic, json serialization when the response get fulfilled)
database capacity (was not an issue for us)
TL;DR: You can't get faster when you are I/O bottlenecked, no matter much how CPU you provide. I/O has to be improve if you want your single service instance to have more throughput, this is mostly done by db connection pool sizing in relation to thread pool sizing in relation to db calls per request. We missed this basic (and well known) relation between resources!
I have a requirement to bulk upload an excel sheet to a DynamoDB table and the maximum number of rows are 200,000. The website for bulk upload will be used less frequently, so we can assume there are only 1 - 2 bulk uploads being processed at a given time. In the backend, I am using Apache POI API to parse the excel sheet into DynamoDB Items.
Because we can only send up to 25 items in a batchWriteItem call, the currently latency is around 15 minutes (900 seconds) to completely upload all the 200,000 items. Hence I am planning to implement multi threading to execute multiple batchWriteItem API calls in parallel. Can you help me understand which EC2 host types are best suited for multi-threading for this purpose.
Any references will be really helpful.
Normally, multi-threading would be helped by using an Instance Type that has multiple CPUs.
However, you are describing behaviour that is waiting on network rather than CPU. Therefore, it is likely that the operation you describe is not being heavily impacted by CPU Utilization.
The best way to answer your question is to recommend that you experiment with different instance types to find the one that is best for your application's combination of needs:
Pick an instance family (eg m5) and try a few different sizes
Compare this against another family (eg c5) to see whether the improved performance is worth the extra cost
Monitor the application to find the bottleneck, which would either be RAM, CPU, Network or Disk access
Please note that smaller instances have less Network bandwidth, so you might need to choose a larger instance type to avoid being throttled on network bandwidth. This might result in excess CPU that isn't being fully utilized.
We have a cluster of workers that send indexing requests to a 4-node Elasticsearch cluster. The documents are indexed as they are generated, and since the workers have a high degree of concurrency, Elasticsearch is having trouble handling all the requests. To give some numbers, the workers process up to 3,200 tasks at the same time, and each task usually generates about 13 indexing requests. This generates an instantaneous rate that is between 60 and 250 indexing requests per second.
From the start, Elasticsearch had problems and requests were timing out or returning 429. To get around this, we increased the timeout on our workers to 200 seconds and increased the write thread pool queue size on our nodes to 700.
That's not a satisfactory long-term solution though, and I was looking for alternatives. I have noticed that when I copied an index within the same cluster with elasticdump, the write thread pool was almost empty and I attributed that to the fact that elasticdump batches indexing requests and (probably) uses the bulk API to communicate with Elasticsearch.
That gave me the idea that I could write a buffer that receives requests from the workers, batches them in groups of 200-300 requests and then sends the bulk request to Elasticsearch for one group only.
Does such a thing already exist, and does it sound like a good idea?
First of all, it's important to understand what happens behind the scene when you send the index request to Elasticsearch, to troubleshoot the issue or finding the root-cause.
Elasticsearch has several thread pools but for indexing requests(single/bulk) write threadpool is being used, please check this according to your Elasticsearch version as Elastic keeps on changing the threadpools(earlier there was a separate threadpool for single and bulk request with different queue capacity).
In the latest ES version(7.10) write threadpool's queue capacity increased significantly to 10000 from 200(exist in earlier release), there may be below reasons to do it.
Elasticsearch now prefers to buffer more indexing requests instead of rejecting the requests.
Although increasing queue capacity means more latency but it's a trade-off and this will reduce the data-loss if the client doesn't have the retry mechanism.
I am sure, you would have not moved to ES 7.9 version, when capacity was increased, but you can increase the size of this queue slowly and allocate more processors(if you have more capacity) easily through the config change mentioned in this official example. Although this is a very debatable topic and a lot of people consider this as a band-aid solution than the proper fix, but now as Elastic themself increased the queue size, you can also try it, and if you have a short duration of increased traffic than it makes even more sense.
Another critical thing is to find out the root cause why your ES nodes are queuing up more requests, it can be legitimate like increasing indexing traffic and infra reached its limit. but if it's not legitimate you can have a look at my short tips to improve one-time indexing performance and overall indexing performance, by implementing these tips you will get a better indexing rate which will reduce the pressure on write thread pool queue.
Edit: As mentioned by #Val in the comment, if you are also indexing docs one by one then moving to bulk index API will give you the biggest boost.
Is it logical to say: "If average service time for a request is X and affordable waiting time for the requests is Y then maximum number of concurrent requests to serve would be Y / X" ?
I think what I'm asking is that if there're any hidden factors that I'm not taking into account!?
If you're talking specifically about webservers, then no, your formula doesn't work, because webservers are designed to handle multiple, simultaneous requests, using forking or threading.
This turns the formula into something far harder to quantify - in my experience, web servers can handle LOTS (i.e. hundreds or thousands) of concurrent requests which consume little or no time, but tend to reduce that concurrency quite dramatically as the requests consume more time.
That means that "average service time" isn't massively useful - it can hide wide variations, and it's actually the outliers that affect you the most.
Broadly yes, but your service provider (webserver in your case) is capable of handling more than one request in parallel, so you should take that into account. I assume you measured end to end service time and havent already averaged it by number of parallel streams. One other thing you didnt and cannot realistically measure is the delay to/from your website.
What you are heading towards is the Erlang unit (not the language using the same name) which is used to described how much load a system can take. Erlangs are unitless (it is just a number) and originated from old school telephony, POTS, where it was used to describe how many wires were needed to handle X calls per time period with low blocking probability. Beyond erlang is engset which is used more for high capacity systems, such as mobile systems.
It also gets used for expensive consultant reports into realtime computer systems and databases to describe the point at which performance degradation is likely to occur. Wikipedia has an article on this http://en.wikipedia.org/wiki/Erlang_(unit) and the book 'Fixed and mobile telecommunications, network systems and services' has a good chapter on performance analysis.
While aimed at telephone systems, just replace with word webserver and it behaves the same. A webserver is the same concept, load is offered that arrives at random intervals to a system with finite parallel capacity. In your case, you can probably calculate total load with load tools easier than parallel capacity and then back calculate the formulas. This is widely done to gain a level of confidence in overall system models.
Erlang/engsetformulas are really useful when you have a randomly arriving load over parallel stream (ie web requests) and a service time that can only be averaged or estimated (ie it varies in real life). You can then calculate the blocking probability, which is the probability a new request will need to wait while current requests are serviced, and how long it will wait. It also helps analyse whether you need to handle more requests in parallel, or make each faster (#lines and holding time in erlang speak)
You will probably look into queuing systems analysis next, as a soon as requests block (queue), the models change slightly.
many factors are not taken into account
memory limits
data locking constraints such as people wanting to update the same data
application latency
caching mechanisms
different users will have different tasks on the site and put different loads
That said, one easy way to get a rough estimate is with apache ab tool (apache benchmark)
Example, get 1000 times the homepage with 100 requests at a time:
ab -c 100 -n 1000 http://www.example.com/
I have setup AWS AutoScaling as following:
1) created a Load Balancer and registered one instance with it;
2) added Health Checks to the ELB;
3) added 2 Alarms:
- CPU Usage -> 60% for 60s, spin up 1 instance;
- CPU usage < 40% for 120s, spin down 1 instance;
4) wrote a jMeter script to send traffic to the website in question: 250 threads, 200 seconds ramp up time, loop count 5.
What I am seeing was very strange.
I expect the CPU usage to shoot up with the higher number of users. But instead the CPU usage stays between 20-30% (which is why the new instance never fires up) and running instance starts throwing timeout errors once it reaches anything more than 100 users.
I am at a loss to understand why CPU usage is so low when the website is in fact timing out.
Ideas?
This could be a problem with the ELB. The ELB does not scale very quickly, it takes a consistent amount of traffic to the ELB to let amazon know you need a bigger one. If you just hit it really hard all at once that does not help it scale. So the ELB could be having problems handling all the connections.
Is this SSL? Are you doing SSL on the ELB? That would add overhead to an underscaled ELB as well.
I would honestly recommend not using ELB at all. haproxy is a much better product and much faster in most cases. I can elaborate if needed, but just look at how Amazon handles the cname vs what you can do with haproxy...
It sounds like you are testing AutoScaling to ensure it will work for your needs. As a first pass to simply see if AS will launch a new instance, try reducing your CPU up check to trigger at 25%. I realize this is a lot lower than you are hoping to use moving forward, but it will help validate that your initial configuration is working.
As a second step, you should take a look at your application and see if CPU is the best metric to have AS monitor for scaling. It is possible that you have a bottleneck somewhere else in your app that may not necessarily be CPU related (web server tuning, memory, databases, storage, etc). You didn't mention what type of content you're serving out; is it static or generated by an interpreter (like PHP or something else)? You could also send your own custom metric data into CloudWatch and use this metric to trigger the scaling.
You may also want to time how long it takes for an instance to be ready to serve traffic from a cold start. If it takes longer than 60 seconds, you may want to adjust your monitoring threshold time appropriately (or set cool down periods). As chantheman pointed out, it can take some time for the ELB to register the instance as well (and a longer amount of time if the new instance is in a different AZ).
I hope all of this helps.
What we discovered is that when you are using autoscale on t2 instances, and under heavy load, those instances will run out of CPU credits and then they are limited to 20% of CPU (from the monitoring point of view, internal htop is still 100%). Internally they are at maximum load.
This sends false metric to Autoscaling and news instances will not fire.
You need to change metric or develop you own or move to m instances.