What's the best way to provide real-time monitoring of the total count of messages sent to an SQS queue?
I currently have a Grafana dashboard set up to monitor an SQS queue, but it seems to refresh about every two minutes. I'm looking to get something set up to update almost in real-time, e.g. refresh every second.
The queue I'm using consumes around 6,000 messages per minute.
Colleagues of mine have built something for real-time monitoring of uploads to an S3 bucket, using a lambda to populate a PostgreSQL DB and using Grafana to query this.
Is this the best way of achieving this? Is there a more efficient way?
SQS is not event driven - it must be polled. Therefore, there isn't an event each time a message is put into the queue or removed from it. With S3 to Lambda there is an event sent in pretty much real time every time an object has been created or removed.
You can change the polling interval for SQS and poll as fast as you'd like. But be aware that polling does have a cost. The first 1 million requests a month are free.
I'm not sure what you're trying to accomplish (I'll address after my idea), but there's certainly a couple ways you could accomplish this. Each has positive and negative.
In every place you produce or consume messages, increment or decrement a cloudwatch metric (or datadog, librato, etc). It's still polling-based, but you could get the granularity down (even by using Cloudwatch) to 15-60 seconds. The biggest problem here is that it's error prone (what happens if the SQS message times out and gets reprocessed?).
Create a secondary queue. Each message that goes into this queue is either a "add" or "delete" message. Attach a lambda, container, autoscale group to process the queue and update metrics in an RDS or DynamoDB table. Query the table as needed.
Use a different queue processing system instead of SQS. I've seen RabbitMQ and Sensu used in very large environments, they will easily handle 6,000 messages per minute.
Keep in mind, there are a lot more metrics than just number of messages in the queue. I've recently become really fond of ApproximateAgeOfOldestMessage, because it indicates whether messages are being processed without error. Here's a blog post about the most helpful SQS metrics. It's called How to Monitor Amazon SQS with CloudWatch
Related
Given following architecture:
The issue with that is that we reach throttling due to the maximum number of concurrent lambda executions (1K per account).
How can this be address or circumvented?
We want to have full control of the rate-limiting.
1) Request concurrency increase.
This would probably be the easiest solution but it would increase the potential workload quite much. It doesn't resolve the root cause nor does it give us any flexibility or room for any custom rate-limiting.
2) Rate Limiting API
This would only address one component, as the API is not the only trigger of the step-functions. Besides, it will have impact to the clients, as they will receive a 4x response.
3) Adding SQS in front of SFN
This will be one of our choices nevertheless, as it is always good to have a queue on top of such number of events. However, a simple queue on top does not provide rate-limiting.
As SQS can't be configured to execute SFN directly a lambda in between would be required, which then triggers then SFN by code. Without any more logic this would not solve the concurrency issues.
4) FIFO-SQS in front of SFN
Something along the line what this blog-post is explaining.
Summary: By using a virtually grouped items we can define the number of items that are being processed. As this solution works quite good for their use-case, I am actually not convinced it would be a good approach for our use-case. Because the SQS-consumer is not the indicator of the workload, as it only triggers the step-functions.
Due to uneven workload this is not optimal as it would be better to have the concurrency distributed by actual workload rather than by chance.
5) Kinesis Data Stream
By using Kinesis data stream with predefined shards and batch-sizes we can implement the logic of rate-limiting. However, this leaves us with the exact same issues described in (3).
6) Provisioned Concurrency
Assuming we have an SQS in front of the SFN, the SQS-consumer can be configured with a fixed provision concurrency. The value could be calculated by the account's maximum allowed concurrency in conjunction with the number of parallel tasks of the step-functions. It looks like we can find a proper value here.
But once the quota is reached, SQS will still retry to send messages. And once max is reached the message will end up in DLQ. This blog-post explains it quite good.
7) EventSourceMapping toogle by CloudWatch Metrics (sort of circuit breaker)
Assuming we have a SQS in front of SFN and a consumer-lambda.
We could create CW-metrics and trigger the execution of a lambda once a metric is hit. The event-lambda could then temporarily disable the event-source-mapping between the SQS and the consumer-lambda. Once the workload of the system eases another event could be send to enable the source-mapping again.
Something like:
However, I wasn't able to determine proper metrics to react on before the throttling kicks in. Additionally, CW-metrics are dealing with 1-minute frames. So the event might happen too late already.
8) ???
Question itself is a nice overview of all the major options. Well done.
You could implement throttling directly with API Gateway. This is the easiest option if you can afford rejecting the client every once in a while.
If you need stream and buffer control, go for Kinesis. You can even put all your events in S3 bucket and trigger lambdas or Step Function when a new event has been stored (more here). Yes, you will ingest events differently and you will need a bridge lambda function to trigger Step Function based on Kinesis events. But this is relatively low implementation effort.
We have basically
dynamodb streams =>
trigger lambda (batch size XX, concurrency 1, retries YY) =>
write to service
There are multiple shards, so we may have some number of concurrent writes to the service. Under some conditions too many streams have too much data, and too many lambda instances are writing to the service, which then responds with 429.
Right now the failure simply ends up being a failure, the lambda retries, but the service is still overwhelmed.
What we would like to do is just have the lambda triggers delay before triggering a lambda retry, essentially have an exponential backoff before triggering. We can easily implement that "inside" the lambda, we can retry and wait for up to the 15m lambda duration.
But then we are billed for whole lambda execution time, while it is sleeping for however many backoffs are required.
Is there a way to configure the lambda/dynamodb trigger to have a delay (that we can control up and down) before invoking the retry? For SQS triggers there is some talk of redrive policy that somehow can control the rate of retries - but not clear how or if that applies to dynamodb streams.
I understand that the streams will "backup" as we slow down the dispatch of lambdas, but this is assumed to be a transient situation, and the dynamodb stream will act as a queue. And we can also configure a dead letter queue, but that is sort of orthogonal to the basic question.
You can configure a wait. And yes, while you are billed by the time use, its pennies. Seriously, the free aws account covers a million lambda invocations a month. At the enterprise level its really nothing compared to what EC2 servers cost. But Im not your CFO so maybe it is a concern.
You can take your stream and process it into whatever service calls you would need and have their paylods all added to the same SQS. You can configure your SQS to throttle it self in effect, so it only sends so many over a given time. The messages in your queue wold go to another lambda that would do the service call for you, one at a time. It would be doled out by the SQS
set up a Dead Letter Queue instead (possibly in combination with either of the above) to catch the failed ones and try again when traffic is lower.
As an aside, you dont want to 'pause' your dynamo stream as it only has a 24 hour TTL on it. If your stream pauses for too long you will loose data. Better to take the stream in whole and put it into an SQS queue as individual writes because SQS has a TTL of up to 14 days.
I have to fetch an external API that has a limitation of a few hundred req/min, to process an unknown amount of events : last week events ( events i store as dynamoDB objects ), and call this API with each of them.
My first idea is to do the following :
Get all the events for a specific day from dynamo ( but i could get fewer )
Put those events in an SQS queue
have SQS Events trigger another lambda with a reserved concurrency set low enough ( let's say to 2 ) that will request the API.
Since the lambda has a ~100ms duration, will I have a maximum of 20 req/sec here ?
I my logic correct here ?
Thanks.
I think your solution generally makes sense. One of the other things you should be aware of is the VisibilityTimeout on the SQS queue. This basically means
hide anything that's been read for ${VisibilityTimeout} seconds, before making it visible for processing again
Keep in mind if you get an error in your Lambda, the queue message will just stay in the queue. For more on that, see this article, which I found helpful.
The other approach you could take if you still run into throttling issues with your external API is to set up a CloudWatch event that wakes up every so often (let's say every 5 minutes) and explicitly calls your lambda. You'd need to retrofit your Lambda to explicitly read messages from the queue, and then process them. This would give you a little more control to "sip" messages using the receiveMessage method on the SQS SDK.
I would like to use queues dynamically generated in ActiveMQ to serialize the handling of events generated by multiple sources.
I need this to be sure that updates on the same record are never in conflicts.
The problem is that I need a different queue for each set of updates that relate to the same record.
There could be in theory millions of records and, of course, I do not want to create millions of queues.
Ideally, a queue should be created when necessary and destroyed when all the updates are completed.
The events that fire the updates are asynchronous but are still correlated. I know that when something happens, several events will be fired in the same time.
It is practically a small burst of asynchronous but correlated updates.
After some time, the queue generated could be deleted.
I understand that there is a cost in creating and deleting queues, but am I right thinking that the cost of generated and deleting these queues with a rate that, during a peak, won't be higher than a few queues per seconds, won't create performance issues ?
There is a cost of temporary queues but generally not that high unless you have high network latency between app server and broker and you should be fine.
Temporary queues, though, have some limits. Such as they are deleted once the created connection goes down. So, if you want your job to resume after a system restart, don't depend on temp-queues. I advice against dynamically creating regular queues at multiple/sec rate. The system is not designed for that.
Generally what you want to do while processing a group of related messages is to utilize message groups. That way, you can use a single queue that does not depend on the producer/temp queue creator connection.
I have some questions regarding the capabilities regarding Azure Queues, Functions, and Workers. I'm not really sure how this works.
Scenario:
q-notifications is an queue in an Azure storage account.
f-process-notification is a function in Azure that is bound to q-notifications. Its job is to get the first message on the queue and process it.
In theory when a message is added to q-notifications, the function f-process-notification should be called.
Questions:
Does the triggered function replace the need to have workers? In other words, is f-process-notification called each time a message is placed in the queue.
Suppose I place a message on the queue that has a visibility timeout of 5 minutes. Basically I am queueing the message but it shouldn't be acted on until 5 minutes pass. Does the queue trigger f-process-notification immediately when the message is placed on the queue, or will it only trigger f-process-notification when the message becomes visible, i.e. 5 minutes after it is placed on the queue?
In Azure Functions, each Function App instance running your queue triggered function will have its own listener for the target queue. It monitors the queue for new work using an exponential backoff strategy. When new items are added to the queue the listener will pull multiple items off of the queue (batching behavior is configurable) and dispatch then in parallel to your function. If your function is successful, the message is deleted, otherwise it will remain on the queue to be reprocessed. To answer your question - yes we respect any visibility timeout you specify. If a message is added with a 5 minute timeout it will only be processed after that.
Regarding scale out - when N instances of your Function App are running they will all cooperate in processing the queue. Each queue listener will independently pull batches of messages off the queue to process. In effect, the work will be load balanced across the N instances. Exactly what you want :) Azure Functions is implementing all the complexities of the multiple consumer/worker pattern for you behind the scenes.
I typically use a listener logic as opposed to triggers. The consumer(s) are constantly monitoring the queue for messages. If you have multiple consumers, for example 5 instances of the consuming code in different Azure worker roles processing the same bus/queue, the first consumer to get the message wins (they are "competing"). This provides a scaling scenario common in a SOA architecture..
This article describes some of the ways to defer processing.
http://markheath.net/post/defer-processing-azure-service-bus-message
good luck!