I am looking at the documentation on Lamba Limits which says:
Number of file descriptors 1,024
I am wondering if this is per invoking lambda or total across all lambdas?
I am processing a very large number of items from a kinesis stream and I am calling a web endpoint and it I seem to be hitting a bottle neck of about 1024 concurrent connections to the API and I'm not sure where the bottleneck is. I'm investigating limits on my load balancer and instances but I'm also wondering if lambda itself simply cannot create more than 1024 concurrent outbound connections across all lambdas?
This question is old, but a suitable answer may help others in the future. The limit as correctly noted in the question is 1,024 outbound connections per Lambda function. However this limit is only for the life cycle of the container. There are currently no public documents stating the length of the life cycle, however through my own testing it resulted in the following:
A new container is created after 5 minutes of idle time for the Lambda function
A new container is created after 60 minutes of frequent use of the Lambda function
A new container is created on any update to the code or configuration of the Lambda
A final note on the new containers, when a new container is created it will run all of your code from the start whereas invoking a warm container will just invoke the handler, skipping the loading of the libraries etc. As this is the case it is a best practice to implement connection pooling and declare the connection outside of the handler so that it can be reused in subsequent invokes, examples of this can be found in the AWS docs
Related
Given following architecture:
The issue with that is that we reach throttling due to the maximum number of concurrent lambda executions (1K per account).
How can this be address or circumvented?
We want to have full control of the rate-limiting.
1) Request concurrency increase.
This would probably be the easiest solution but it would increase the potential workload quite much. It doesn't resolve the root cause nor does it give us any flexibility or room for any custom rate-limiting.
2) Rate Limiting API
This would only address one component, as the API is not the only trigger of the step-functions. Besides, it will have impact to the clients, as they will receive a 4x response.
3) Adding SQS in front of SFN
This will be one of our choices nevertheless, as it is always good to have a queue on top of such number of events. However, a simple queue on top does not provide rate-limiting.
As SQS can't be configured to execute SFN directly a lambda in between would be required, which then triggers then SFN by code. Without any more logic this would not solve the concurrency issues.
4) FIFO-SQS in front of SFN
Something along the line what this blog-post is explaining.
Summary: By using a virtually grouped items we can define the number of items that are being processed. As this solution works quite good for their use-case, I am actually not convinced it would be a good approach for our use-case. Because the SQS-consumer is not the indicator of the workload, as it only triggers the step-functions.
Due to uneven workload this is not optimal as it would be better to have the concurrency distributed by actual workload rather than by chance.
5) Kinesis Data Stream
By using Kinesis data stream with predefined shards and batch-sizes we can implement the logic of rate-limiting. However, this leaves us with the exact same issues described in (3).
6) Provisioned Concurrency
Assuming we have an SQS in front of the SFN, the SQS-consumer can be configured with a fixed provision concurrency. The value could be calculated by the account's maximum allowed concurrency in conjunction with the number of parallel tasks of the step-functions. It looks like we can find a proper value here.
But once the quota is reached, SQS will still retry to send messages. And once max is reached the message will end up in DLQ. This blog-post explains it quite good.
7) EventSourceMapping toogle by CloudWatch Metrics (sort of circuit breaker)
Assuming we have a SQS in front of SFN and a consumer-lambda.
We could create CW-metrics and trigger the execution of a lambda once a metric is hit. The event-lambda could then temporarily disable the event-source-mapping between the SQS and the consumer-lambda. Once the workload of the system eases another event could be send to enable the source-mapping again.
Something like:
However, I wasn't able to determine proper metrics to react on before the throttling kicks in. Additionally, CW-metrics are dealing with 1-minute frames. So the event might happen too late already.
8) ???
Question itself is a nice overview of all the major options. Well done.
You could implement throttling directly with API Gateway. This is the easiest option if you can afford rejecting the client every once in a while.
If you need stream and buffer control, go for Kinesis. You can even put all your events in S3 bucket and trigger lambdas or Step Function when a new event has been stored (more here). Yes, you will ingest events differently and you will need a bridge lambda function to trigger Step Function based on Kinesis events. But this is relatively low implementation effort.
We have basically
dynamodb streams =>
trigger lambda (batch size XX, concurrency 1, retries YY) =>
write to service
There are multiple shards, so we may have some number of concurrent writes to the service. Under some conditions too many streams have too much data, and too many lambda instances are writing to the service, which then responds with 429.
Right now the failure simply ends up being a failure, the lambda retries, but the service is still overwhelmed.
What we would like to do is just have the lambda triggers delay before triggering a lambda retry, essentially have an exponential backoff before triggering. We can easily implement that "inside" the lambda, we can retry and wait for up to the 15m lambda duration.
But then we are billed for whole lambda execution time, while it is sleeping for however many backoffs are required.
Is there a way to configure the lambda/dynamodb trigger to have a delay (that we can control up and down) before invoking the retry? For SQS triggers there is some talk of redrive policy that somehow can control the rate of retries - but not clear how or if that applies to dynamodb streams.
I understand that the streams will "backup" as we slow down the dispatch of lambdas, but this is assumed to be a transient situation, and the dynamodb stream will act as a queue. And we can also configure a dead letter queue, but that is sort of orthogonal to the basic question.
You can configure a wait. And yes, while you are billed by the time use, its pennies. Seriously, the free aws account covers a million lambda invocations a month. At the enterprise level its really nothing compared to what EC2 servers cost. But Im not your CFO so maybe it is a concern.
You can take your stream and process it into whatever service calls you would need and have their paylods all added to the same SQS. You can configure your SQS to throttle it self in effect, so it only sends so many over a given time. The messages in your queue wold go to another lambda that would do the service call for you, one at a time. It would be doled out by the SQS
set up a Dead Letter Queue instead (possibly in combination with either of the above) to catch the failed ones and try again when traffic is lower.
As an aside, you dont want to 'pause' your dynamo stream as it only has a 24 hour TTL on it. If your stream pauses for too long you will loose data. Better to take the stream in whole and put it into an SQS queue as individual writes because SQS has a TTL of up to 14 days.
We are building an asp.net core 3 application which uses ef core 3.0 with Pomelo.EntityFrameworkCore.MySql provider 3.0.
Right now we are trying to replace all database calls from sync to async, like:
//from
dbContext.SaveChanges();
//to
await dbContext.SaveChangesAsync();
Unfortunetly when we do it we expereince two issues:
Number of connections to the server grows significatntly compared to the same tests for sync calls
Average processing speed of our application drops significantly
What is the recommended way to use ef core with mysql asynchronously? Any working example or evidence of using ef-core 3 with MySql asynchonously would be appreciated.
It's hard to say what the issue here is without seeing more code. Can you provide us with a small sample app that reproduces the issue?
Any DbContext instance uses exactly one database connection for normal operations, independent of whether you call sync or async methods.
Number of connections to the server grows significatntly compared to the same tests for sync calls
What kind of tests are we talking about? Are they automated? If so, how many tests are being run? Because of the nature of async calls, if you run 1000 tests in parallel, every test with its own DbContext, you will end up with 1000 parallel connections.
Though with Pomelo, you will not end up additionally with 1000 threads, as you would with using Oracle's provider.
Update:
We test asp.net core call (mvc) which goes to db and read and writes something. 50 threads, using DbContextPool with limit 500. If i use dbContext.SaveChanges(), Add(), all context methods sync, I am land up with around 50 connections to MySql, using dbContext.SaveChangesAsnyc() also AddAsnyc, ReadAsync etc, I end up seeing max 250 connection to MySql and the average response time of the page drops by factor of 2 to 3.
(I am talking about ASP.NET and requests below. The same is true for parallel run test cases.)
If you use Async methods all the way, nothing will block, so your 50 threads are free to handle the next 50 requests while the database is still executing the queries for the first 50 requests.
This will happen again and again because ASP.NET might process your requests faster than your database can return its results. So you will end up with a lot of parallel database queries.
This does not happen when executing the Sync methods, because every thread blocks and you end up with a maximum of 50 parallel queries (one per thread).
So this is expected behavior and just a consequence of async method calls.
You can always modify your code or web server configuration to limit the amount of concurrent ASP.NET requests.
50 threads, using DbContextPool with limit 500.
Also be aware that DbContextPool does not limit how many DbContext objects can concurrently exist, but only how many will be kept in the pool. So if you set DbContextPool to 500, you can create more than 500 contexts, but only 500 will be kept alive after using them.
Update:
There is a very interesting low level talk about lock-free database pool programming from #roji that addresses this behavior and takes your position, that there should be an upper limit in the connection pool that should result in blocking when exceeded and makes a great case for this behavior.
According to #bgrainger from MySqlConnector, that is how it is already implemented (the docs did not explicitly state this, but they do now). The MaxPoolSize connection string option has a default value of 100, so if you use connection pooling and if you don't overwrite this value and if you don't use multiple connection pools, you should not have more than 100 connections active at a given time.
From GitHub:
This is a documentation error, if you are interpreting the docs to mean that you can create an unlimited number of connections.
When pooling is true, each connection pool (there is one per unique connection string) only allows MaximumPoolSize connections to be open simultaneously. Each additional call to MySqlConnection.Open will block until a connection is returned to the pool.
When pooling is false, there is no limit to the number of connections that can be opened simultaneously; it's up to the user to manage the concurrency.
Check to see whether you have Pooling=false in your connection string, as mentioned by Bradley Grainger in comments.
After I removed pooling=false from my connection string, my app ran literally 3x faster.
I have a specific Lambda function invoked by SNS events that repeatedly times out in about 1/2 of its instances that seem to be running any of the handler code.
What's peculiar is that I have a number of log statements at the very start of the function handler that should be getting triggered.
I've tried increasing the timeout to 120 seconds, but this doesn't fix anything. I've also looked at the Lambda init logic (the code outside the main handler method) but its just simple imports and class initialisation, no database connections or HTTP requests that might be causing a timeout.
The handler logic does include database connections and network requests, but those were timing out then I'd expect to also see some logs prior to the timeouts.
When I view the Lambda logs by stream then around half of them look like the above and just time out, whereas the other half run as expected. Are streams specific to individual Lambda containers? If so then it looks as if there is a number of "dead" containers.
Has anyone experienced an issue like this in the past or has any idea what is going on?
This issue was fixed after realising that the lambda was inside two different subnets, one of which didn't have a NAT gateway. After moving the lambda to a single subnet with a NAT the timeouts have stopped.
We've got several web and worker roles in Azure connecting to our Azure Redis cache via the StackExchange.Redis library, and we're receiving regular timeouts that are making our end-to-end solution grind to a halt. An example of one of them is below:
System.TimeoutException: Timeout performing GET stream:459, inst: 4,
mgr: Inactive, queue: 12, qu=0, qs=12, qc=0, wr=0/0, in=65536/0 at
StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message
message, ResultProcessor1 processor, ServerEndPoint server) in
c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\ConnectionMultiplexer.cs:line
1785 at StackExchange.Redis.RedisBase.ExecuteSync[T](Message
message, ResultProcessor1 processor, ServerEndPoint server) in
c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisBase.cs:line
79 at StackExchange.Redis.RedisDatabase.StringGet(RedisKey key,
CommandFlags flags) in
c:\TeamCity\buildAgent\work\58bc9a6df18a3782\StackExchange.Redis\StackExchange\Redis\RedisDatabase.cs:line
1346 at
OptiRTC.Cache.RedisCacheActions.<>c__DisplayClass41.<Get>b__3() in
c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheActions.cs:line 104 at
Polly.Retry.RetryPolicy.Implementation(Action action, IEnumerable1
shouldRetryPredicates, Func`1 policyStateFactory) at
OptiRTC.Cache.RedisCacheActions.Get[T](String key, Boolean
allowDirtyRead) in
c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheActions.cs:line 107 at
OptiRTC.Cache.RedisCacheAccess.d__e4.MoveNext()
in c:\dev\OptiRTCAzure\OptiRTC.Cache\RedisCacheAccess.cs:line 1196;
TraceSource 'WaWorkerHost.exe' event
All the timeouts have different queue and qs numbers, but the rest of the messages are consistent. These StringGet calls are across different keys in the cache. In each of our services, we use a singleton cache access class with a single ConnectionMultiplexer that is registered with our IoC container in the web or worker role startup:
container.RegisterInstance<ICacheAccess>(cacheAccess);
In our implementation of ICacheAccess, we're creating the multiplexer as follows:
ConfigurationOptions options = new ConfigurationOptions();
options.EndPoints.Add(serverAddress);
options.Ssl = true;
options.Password = accessKey;
options.ConnectTimeout = 1000;
options.SyncTimeout = 2500;
redis = ConnectionMultiplexer.Connect(options);
where the redis object is used throughout the instance. We've got about 20 web and worker role instances connecting to the cache via this ICacheAccess implementation, but the management console shows an average of 200 concurrent connections to the cache.
I've seen other posts that reference using version 1.0.333 of StackExchange.Redis, which we're doing via NuGet, but when I look at the actual version of the StackExchange.Redis.dll reference added, it shows 1.0.316.0. We've tried adding and removing the NuGet reference as well as adding it to a new project, and we always get the version discrepancy.
Any insight would be appreciated. Thanks.
Additional information:
We've upgraded to 1.0.371. We have two services that each access the same cache object at different intervals, one to edit and occasionally read and one that reads this object several times a second. Both services are deployed with the same caching code and StackExchange.Redis library version. I almost never see time outs in the service that edits the object but I get timeouts between 50 and 75% of the time on the services that reads it. The timeouts have the same format as the one indicated above, and they continue to occur after wrapping the db.StringGet call in a Polly retry block that handles both RedisException and System.TimeoutException and retries once after 500ms.
We contacted Microsoft about this issue, and they confirm that they see nothing in the Redis logs that indicate an issue on the Redis service side. Our cache miss % is extremely low on the Redis server, but we continue to get these timeouts, which substantially hinder our application's functionality.
In response to the comments, yes, we always have a number in qs and never in qc. We always have a number in the first part of the in and never in the second.
Even more additional information:
When I run a service with fewer instances at a higher CPU, I get significantly more of these timeout errors than when instances are running at lower CPUs. More specifically, I pulled some numbers from our services this morning. When they were running at around 30% CPU, I saw very few timeout issues - just 42 over 30 minutes. When I removed half the instances and they started to run at around 60-65% CPU, the rate increased 10-fold to 536 over 30 minutes.
I know this thread is months old but I think my own experiences can add some value here. I had the same problem with Azure Redis Cache (timeouts on Gets) but realized that it was almost exclusively happening on Gets where the string value was relatively large (> 250K in length). I implemented gzip on both Gets and Sets (when the string value is large) and now I almost never get a timeout.
Even if this doesn't solve your particular problem, it's probably good practice to compress the values in general to reduce costs and improve performance.
Regarding the version numbers, it seems that the AssemblyVersion has been locked at 1.0.316 for the last several releases, but the AssemblyFileVersion has been updated to match the NuGet package version. For now, I recommend ignoring AssemblyVersion and just using AssemblyFileVersion to ensure you have the correct binary.
Please contact us at AzureCache#microsoft.com if you are still seeing timeouts using Azure Redis Cache.