Placing custom execution rate limits on AWS Step Functions - throttling

I have a Step functions setup that is spawning a Lambda function. The Lambda functions are getting spawned too fast and causing other services to throttle, so I would like Step functions to have a rate limit on the number of job it kicks off at a given time.
How do I best approach this?

I would try setting this limit at Lambda Function side using concurrent execution limit - https://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html. This way you can limit maximum number of executions of that specific Lambda Function thus leaving unreserved concurrency pool for rest of functions.

Related

How to scale concurrent step function executions and avoid any maxConcurrent exceptions?

Problem: I have a Lambda which produces an array of objects which can have the length of a few thousands (worst case). Each object in this array should be processed by a stepfunction.
I am trying to figure out what the best scalable and error prone solution is so that every object is processed by the stepfunction.
The complete stepfunction does not have a long execution time (under 5 min) but has to wait in some steps for other services to continue the execution (WaitForTaskToken). The stepfunction contains a few short running lambdas.
These are the possibilities I have at the moment:
1. Naive approach: In my head a few thousands or even ten thousands execution concurrent are not a big deal so why can't I just iterate over each element and start an execution directly from the lambda?
2. SQS. Lambda can put each object into SQS and another lambda processes a batch of 10 and starts 10 stepfunction executions. Then I could have some max concurrency of the processing lambda to avoid to many stepfunction executions. But this explains of some issues with such an approach where messages could not be processed, and overall this is alot of overhead I think.
3. Using a Map State: I just could give the array to a mapstate which runs for each object the statemachine with max 40 concurrent iterations. But what if the array is greater than 40? Can I just catch the error and retry with the objects which were not processed in a error catch state so long until all executions are either done or failed. This means if there is one failed execution I still want to have the other 39 executions to run.
4. Split the object in batches and run them parallel: Similar to 3. but instead of just giving all objects to the map state, there is another state which splits the array in 40s and forwards them to the map state and waits until they are finished to process the next batch. So there is one "main" state which runs for a longer time + 40 worker states at the same time.
All of those approaches only take the step function execution concurrency into account but not the lambda concurrencies. Since the stepfunctions uses lambdas there are also alot of concurrent lambdas running. Could this be an issue? And if so, how can I mitigate this?
Inline Map States can handle lots1 of iterations, but only up to 40 concurrently2. Iterations over the MaxConcurrency don't cause an error. They will be invoked with delay.
If your Step Function is only running ~40 concurrent iterations, Lambda concurrency should not be a constraint either.
I just tested a Map state with 1,000 items. Worked just fine. The Quotas page does not mention an upper limit.
In Distributed mode a Map State can handle 10,000 parallel child executions.

Serverless MapReduce with AWS Lambda for small payload and low latency

I am trying to implement a serverless MapReduce using AWS Lambda functions for small payloads and low latency. I want to use multiple lambda functions for the Map, and another lambda function for the Reduce. Each Map function is expected to take less than a second to complete, and to produce a payload smaller than 100KB per function. I also want the Reduce function to be started at the same time as the Map functions, so that it can start reducing map results as soon as the first ones become available.
My question is the following: what is the best way to send notifications and payloads from the Map functions to the Reduce function, knowing that a single Reduce function must receive notification and payloads from all the Map functions during its lifespan (about 5 seconds)?
The Reduce function could poll an SQS queue, but I need to run multiple MapReduce jobs in parallel, and I am concerned by the latency resulting from the creation of one dedicated SQS queue for each MapReduce job. Is there a better way to approach this problem?

What does overProvisioned Memory mean for Lambda in AWS Cloudwatch?

I am trying to learn more about monitoring and analysis of lamda functions in my serverless environment, to understand how to point out 'suspect' lambdas that need attention. I have running through some sample queries in Logs Insights sections, and I have a few lambdas that have this result.
I'm basically trying to understand if this is something that needs fixing quickly, or if it's not a big deal if there is so much overProvisioned memory?
Should I be more worried looking at Duration/Concurrency issues than this metric?
TLDR: overprovisioned memory and duration affects billing cost. Both parameters can be controlled where possible to cost-effective values.
Allocated memory, together with duration and number of times the lambda is executed per month is used for computing billing cost for the month. [1]
Currently, the lambda uses roughly 14% of provisioned memory at maximum load, the remaining fraction can be utilised.
If you're serving a huge amount of request, reducing over-provisioned memory and duration can be cost effective.
My recommendation is to provision memory to be sum of max load plus (50% - 75%) of max load and reviewing the duration.
Concurrency doesn't factor in monthly billing cost.
Some numbers: [2]
Default concurrency limit for functions = 100
Hard set concurrency limit for account = 1000
Reducing the duration, means you can serve more requests at a time.
The concurrency limit per account can be increased when requested to the AWS Support.
Another typical workaround for concurrency issues is to throttle requests using a queue. This may be more costly.
The lambda receiving the request creates a new SNS topic, envelopes it together with request, pushes it to a message queue and returns caller the topic.
Caller receives and subscribes to topic.
Another lambda processes the queue and report status for the job to the topic.
Caller receives message.
Account limit for number of topics is set at 100,000 [3].
This limit can be increased by requesting to AWS Support. Although cleaning up topics that are no longer necessary to keep around can be more suitable.
Having to design through this workarounds for concurrency limits could mean that the application requirements are more suited for traditional web application backed by a long running server.

how to measure python asyncio event loop metrics?

Is there a module to measure asyncio event loop metrics? or for asyncio event loop, what metrics we should monitor for performance analysis purpose?
e.g.
how many tasks in the event loop?
how many tasks in waiting states?
I'm not trying to measure the coroutine functions. aiomonitor has the functionality, but is not exactly what I need.
I hardly believe number of pending tasks or tasks summary will tell you much. Let's say you have 10000 tasks, 8000 of them pending: is it much, is it not? Who knows.
Thing is - each asyncio task (or any other Python object) can consume different amount of different machine resources.
Instead of trying to monitor asyncio specific objects I think it's better to monitor general metrics:
CPU usage
RAM usage
Network I/O (in case you're dealing with it)
Hard drive I/O (in case you're dealing with it)
What comes to asyncio you should probably always use asyncio.Semaphore to limit max number of currently running jobs and implement a convenient way to change value of semaphore(s) (for example, through config file).
It'll allow to alter workload on a concrete machine depending on its available and actually utilized resources.
Upd:
My question, will asyncio still accept new connections during this
block?
If your event loop is blocked by some CPU calculation, asyncio will start to process new connections later - when event loop is free again (if they're not time-outed at this moment).
You should always avoid situation of freezing event loop. Freezed somewhere event loop means that all tasks everywhere in code are freezed also! Any kind of loop freezing breaks whole idea of using asynchronous approach regardless of number of tasks. Any kind of code, where event loop is freezed will have performance issues.
As you noted, you can use ProcessPoolExecutor with run_in_executor to await for CPU-bound stuff, but you can use ThreadPoolExecutor to avoid freezing as well.

Optimal size of worker pool

I'm building a Go app which uses a "worker pool" of goroutines, initially I start the pool creating a number of workers. I was wondering what would be the optimal number of workers in a mult-core processor, for example in a CPU with 4 cores ? I'm currently using the following aproach:
// init pool
numCPUs := runtime.NumCPU()
runtime.GOMAXPROCS(numCPUs + 1) // numCPUs hot threads + one for async tasks.
maxWorkers := numCPUs * 4
jobQueue := make(chan job.Job)
module := Module{
Dispatcher: job.NewWorkerPool(maxWorkers),
JobQueue: jobQueue,
Router: router,
}
// A buffered channel that we can send work requests on.
module.Dispatcher.Run(jobQueue)
The complete implementation is under
job.NewWorkerPool(maxWorkers)
and
module.Dispatcher.Run(jobQueue)
My use-case for using a worker pool: I have a service which accepts requests and calls multiple external APIs and aggregate their results into a single response. Each call can be done independently from others as the order of results doesn't matter. I dispatch the calls to the worker pool where each call is done in one available goroutine in an asynchronous way. My "request" thread keeps listening to the return channels while fetching and aggregating results as soon as a worker thread is done. When all are done the final aggregated result is returned as a response. Since each external API call may render variable response times some calls can be completed earlier than others. As per my understanding doing it in a parallel way would be better in terms of performance as if compared to doing it in a synchronous way calling each external API one after another
The comments in your sample code suggest you may be conflating the two concepts of GOMAXPROCS and a worker pool. These two concepts are completely distinct in Go.
GOMAXPROCS sets the maximum number of CPU threads the Go runtime will use. This defaults to the number of CPU cores found on the system, and should almost never be changed. The only time I can think of to change this would be if you wanted to explicitly limit a Go program to use fewer than the available CPUs for some reason, then you might set this to 1, for example, even when running on a 4-core CPU. This should only ever matter in rare situations.
TL;DR; Never set runtime.GOMAXPROCS manually.
Worker pools in Go are a set of goroutines, which handle jobs as they arrive. There are different ways of handling worker pools in Go.
What number of workers should you use? There is no objective answer. Probably the only way to know is to benchmark various configurations until you find one that meets your requirements.
As a simple case, suppose your worker pool is doing something very CPU-intensive. In this case, you probably want one worker per CPU.
As a more likely example, though, lets say your workers are doing something more I/O bound--such as reading HTTP requests, or sending email via SMTP. In this case, you may reasonably handle dozens or even thousands of workers per CPU.
And then there's also the question of if you even should use a worker pool. Most problems in Go do not require worker pools at all. I've worked on dozens of production Go programs, and never once used a worker pool in any of them. I've also written many times more one-time-use Go tools, and only used a worker pool maybe once.
And finally, the only way in which GOMAXPROCS and worker pools relate is the same as how goroutines relates to GOMAXPROCS. From the docs:
The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit. This package's GOMAXPROCS function queries and changes the limit.
From this simple description, it's easy to see that there could be many more (potentially hundreds of thousands... or more) goroutines than GOMAXPROCS--GOMAXPROCS only limits how many "operating system threads that can execute user-level Go code simultaneously"--goroutines which aren't executing user-level Go code at the moment don't count. And in I/O-bound goroutines (such as those waiting for a network response) aren't executing code. So you have a theoretical maximum number of goroutines limited only by your system's available memory.

Resources