I've connected a lambda to a DyDB table via a stream. When a record is written to the table, it triggers the lambda. The traffic is very bursty, so nothing might happen for a while, then I'll write several thousand records.
What I'm seeing is a few lambda instances will be triggered, but not enough to handle the burst. Then at random times, the number of lambda instances will jump an order of magnitude or two (from 2 to 90 or more), and it will catch up. The problem is the jump might not occur for 30 minutes or more.
I'm seeing the records written to the table very quickly (seconds). The processing of 20 records by the lambda shouldn't take more than 2 minutes. It seems like the lambdas are spending most of their time sitting around waiting for records to show up. The record key for the table is a GUID.
Things I've tried
Playing with the number of records to make sure there's no lambda timeouts (20 seems to be conservative, but 100 causes timeouts)
Moving the lambda to a different subnet
Batching the writes to the table (~500-1000 records in a batch)
Breaking up the writes in hopes it would trigger more lambdas (~20-100 records in a batch)
Increasing the lambda memory to the max (3GB)
Reducing memory to be larger than used (1GB, 300Mb used)
Is there a better pattern to be using? Should I skip the stream and just write SNS messages? I don't care about order, but would prefer to not run the job more than once.
So here's what I found out.
It looks like the problem is contention on the DynamoDB stream by the lambda instances.
My solution was to skip the DynamoDB stream and not use it, and post to an SNS queue. The lambdas pick up the messages, and scale much better. Times have gone from hours to seconds.
Related
Problem: I have a Lambda which produces an array of objects which can have the length of a few thousands (worst case). Each object in this array should be processed by a stepfunction.
I am trying to figure out what the best scalable and error prone solution is so that every object is processed by the stepfunction.
The complete stepfunction does not have a long execution time (under 5 min) but has to wait in some steps for other services to continue the execution (WaitForTaskToken). The stepfunction contains a few short running lambdas.
These are the possibilities I have at the moment:
1. Naive approach: In my head a few thousands or even ten thousands execution concurrent are not a big deal so why can't I just iterate over each element and start an execution directly from the lambda?
2. SQS. Lambda can put each object into SQS and another lambda processes a batch of 10 and starts 10 stepfunction executions. Then I could have some max concurrency of the processing lambda to avoid to many stepfunction executions. But this explains of some issues with such an approach where messages could not be processed, and overall this is alot of overhead I think.
3. Using a Map State: I just could give the array to a mapstate which runs for each object the statemachine with max 40 concurrent iterations. But what if the array is greater than 40? Can I just catch the error and retry with the objects which were not processed in a error catch state so long until all executions are either done or failed. This means if there is one failed execution I still want to have the other 39 executions to run.
4. Split the object in batches and run them parallel: Similar to 3. but instead of just giving all objects to the map state, there is another state which splits the array in 40s and forwards them to the map state and waits until they are finished to process the next batch. So there is one "main" state which runs for a longer time + 40 worker states at the same time.
All of those approaches only take the step function execution concurrency into account but not the lambda concurrencies. Since the stepfunctions uses lambdas there are also alot of concurrent lambdas running. Could this be an issue? And if so, how can I mitigate this?
Inline Map States can handle lots1 of iterations, but only up to 40 concurrently2. Iterations over the MaxConcurrency don't cause an error. They will be invoked with delay.
If your Step Function is only running ~40 concurrent iterations, Lambda concurrency should not be a constraint either.
I just tested a Map state with 1,000 items. Worked just fine. The Quotas page does not mention an upper limit.
In Distributed mode a Map State can handle 10,000 parallel child executions.
I am using Kinesis to process events in a micro-service architecture. Events are partitioned at a client project level to ensure all events related to the same project occur in the correct sequence. Currently if there is an error processing one of the events, this can cause the events from other partitions to also become blocked. I had hoped that by increasing the parallelisation factor and bisecting the batch on error, this would allow the other lambda processors to continue processing events from other partitions. This is largely the case, but there are still times when multiple partitions become stuck, presumably because kinesis is sometimes deciding to always allocate several partitions to the same lambda processor.
My question is, is there any way to avoid this in kinesis, or will I need to start making use of a dead letter queue, and removing events that are repeatedly failing? Downside to this is that I don't really want to continue processing further events for the same partition once there is a failure, as the state of the micro-service is likely to be corrupt at this point, and I would rather out team manually address whatever issue has occurred before continuing to play events from the failed partition.
We are working on parallelising our Kafka consumer to process more number of records to handle the Peak load. One way, we are already doing is through spinning up as many consumers as many partitions within the same consumer group.
Our Consumer deals with making an API call which is synchronous as of now. We felt making this API call asynchronous will make our consumer handle more load. Hence, we are trying to making the API call Asynchronous and in its response we are increasing the offset. However we are seeing an issue with this:
By making the API call Asynchronous, we may get the response for the last record first and none of the previous record's API calls haven't initiated or done by then. If we commit the offset as soon as we receive the response of the last record, the offset would get changed to the last record. In the meantime if the consumer restarts or partition rebalances, we will not receive any record before the last record we committed the offset as. With this, we will miss out the unprocessed records.
As of now we already have 25 partitions. We are looking forward to understand if someone have achieved parallelism without increasing the partitions or increasing the partitions is the only way to achieve parallelism (to avoid offset issues).
First, you need to decouple (if only at first) the reading of the messages from the processing of these messages. Next look at how many concurrent calls you can make to your API as it doesn't make any sense to call it more frequently than the server can handle, asynchronously or not. If the number of concurrent API calls is roughly equal to the number of partitions you have in your topic, then it doesn't make sense to call the API asynchronously.
If the number of partitions is significantly less than the max number of possible concurrent API calls then you have a few choices. You could try to make the max number of concurrent API calls with fewer threads (one per consumer) by calling the API's asynchronously as you suggest, or you can create more threads and make your calls synchronously. Of course, then you get into the problem of how can your consumers hand their work off to a greater number of shared threads, but that's exactly what streaming execution platforms like Flink or Storm do for you. Streaming platforms (like Flink) that offer checkpoint processing can also handle your problem of how to handle offset commits when messages are processed out of order. You could roll your own checkpoint processing and roll your own shared thread management, but you'd have to really want to avoid using a streaming execution platform.
Finally, you might have more consumers than max possible concurrent API calls, but then I'd suggest that you just have fewer consumers and share partitions, not API calling threads.
And, of course, you can always change the number of your topic partitions to make your preferred option above more feasible.
Either way, to answer your specific question you want to look at how Flink does checkpoint processing with Kafka offset commits. To oversimplify (because I don't think you want to roll your own), the kafka consumers have to remember not only the offsets they just committed, but they have to hold on to the previous committed offsets, and that defines a block of messages flowing though your application. Either that block of messages in its entirety is processed all the way through or you need to rollback the processing state of each thread to the point where the last message in the previous block was processed. Again, that's a major oversimplification, but that's kinda how it's done.
You have to look at kafka batch processing. In a nutshell: you can setup huge batch.size with a little number (or even single) of partitions. As far, as whole batch of messages consumed at consumer side (i.e. in ram memory) - you can parallelize this messages in any way you want.
I would really like to share links, but their number rolls over the web hole.
UPDATE
In terms of committing offsets - you can do this for whole batch.
In general, kafka doesn't achieve target performance requirements by abusing partitions number, but rather relying on batch processing.
I already saw a lot of projects, suffering from partitions scaling (you may see issues later, during rebalancing for example). The rule of thumb - look at every available batch setting first.
I have a Lambda which is configured as a consumer of a Kinesis data stream, with a batch size of 10,000 (maximal).
The lambda parses given records and inserts them to Aurora Postgresql (using an INSERT command).
Somehow, I see that the lambda is invoked most of the time with a relatively small number of records (less than 200), although the 'IteratorAge' is constantly high (about 60 seconds). The records are put in the stream with a random partition key (generated as uuid4), and of size
How can that be explained? As I understand, if the shard is not empty, all the current records, up to the configured batch size, should be polled.
I assume that if the Lambda was invoked with bigger batches this delay could be prevented.
Note: There is also a Kinesis Firehose configured as a consumer (doesn't seem to have any issue).
Finds out that the iterator age of the Kinesis was 0ms, so this behavior makes sense.
The iterator age of the Lambda is a bit different:
Measures the age of the last record for each batch of records processed. Age is the difference between the time Lambda received the batch, and the time the last record in the batch was written to the stream.
I'm looking for a creative and most efficient way to flatten write bursts to dynamodb.
I have 4 cron jobs that run every 3 minutes .each on its own thread. due to reason I can't control they start at the same time.
Part of the jobs is to write a few 1000s of rows to dynamodb. This takes normally 10 to 30 seconsa using batch writes.
Because of the timing the 4 jobs do the writing it in parallel.
I'm looking for the most efficient way to distribute the writes over time .
I don't want to add resources of not necessary. Probably the solution includes some kind of cache and additional cron job.
I have memcache available. However there is probably something more efficient than writing to memcache and reading .
Maybe a log file on the server ?
What would you do?
It's php with apache on ububtu.
An established pattern, especially if you just need the writes to get there eventually, is to put your records into an SQS queue first, and have a background task that reads messages from SQS and puts them into the dynamodb and a maximum prescribed rate - this is useful when you don't want to pay for the high write throughput to match your peak periods of writes to the database.
SQS has the benefit of being able to accept messages at almost any scale and yet you can reduce your dynamodb costs by writing rows at a low, predictable pace.