SQS/SNS and Architecting For Disposable Computing ( EC2 SPOT Instances )

SQS/SNS and Architecting For Disposable Computing ( EC2 SPOT Instances ) - amazon-ec2

I have an application that reads a message from SQS (let's call the queue "p" ), does computationally expensive image processing ( step #1 ), uploads the result to S3 and deletes the message from the queue "p" and then sends a notification to a SNS topic ( this SNS topic routes the message to another queue called "q" ). There is another application that reads from queue "q" and does the second stage of the image processing ( downloads the result of step #1 from S3 and does additional mathematical operations on that result ).
I have a combination of regular instances + spot instances running the step #1 application.
I know that ( because of the SQS visibility time-out concept ) if the spot instances get shut down during image processing phase , SQS makes the messages visible again to other consumers so the non-spot EC2 instances will eventually do the work that the spot instances did not manage to complete due to the system shutdown.
Now my question is : what happens if the spot instances get shut down exactly after the delete but before a message is sent to SNS ? How can we recover from such an event ?
# PSEUDO CODE
msg = read message from queue
result = doWork(msg)
upload result to S3
delete msg
publish to sns about result
Cheers !

First of all, process A should not delete the message from its SQS queue until AFTER it has sent the SNS message to kick of the second process. Deleting the message from the queue is the very last thing you should do to signal that 'my work is done'. Until the SNS message is sent, the work is not done.
Secondly, one of the key things that you need to embrace when designing processes like this, (and especially when using spot instances) is the concept of Idempotence: http://en.wikipedia.org/wiki/Idempotence
A unary operation (or function) is idempotent if, whenever it is applied twice to any value, it gives the same result as if it were applied once
Further more: http://aws.amazon.com/sqs/faqs/#How_many_times_will_I_receive_each_message
Amazon SQS is engineered to provide “at least once” delivery of all messages in its queues. Although most of the time each message will be delivered to your application exactly once, you should design your system so that processing a message more than once does not create any errors or inconsistencies.
What this ultimately means, whether or not a spot instance gets shut down mid-process, there is the real possibility, that a given message in an SQS queue will be simultaneously delivered to multiple worker processes or delivered to the same process more than once, either because SQS sent it twice, or the spot fails after SNS message is sent but before the SQS queue is updated.
Without knowing exactly what your processing entails I couldn't tell you how to make your process idempotent, but don't try to solve the problem 'what happens if the spot instances gets shutdown mid-stream', think about 'how do I design each step in the process so that it can be run multiple times, with the same inputs and not cause any problems - if you do that, you will kill two birds with one stone.

Related

How to control event handling rate with a serverless stack

I have to fetch an external API that has a limitation of a few hundred req/min, to process an unknown amount of events : last week events ( events i store as dynamoDB objects ), and call this API with each of them.
My first idea is to do the following :
Get all the events for a specific day from dynamo ( but i could get fewer )
Put those events in an SQS queue
have SQS Events trigger another lambda with a reserved concurrency set low enough ( let's say to 2 ) that will request the API.
Since the lambda has a ~100ms duration, will I have a maximum of 20 req/sec here ?
I my logic correct here ?
Thanks.

I think your solution generally makes sense. One of the other things you should be aware of is the VisibilityTimeout on the SQS queue. This basically means
hide anything that's been read for ${VisibilityTimeout} seconds, before making it visible for processing again
Keep in mind if you get an error in your Lambda, the queue message will just stay in the queue. For more on that, see this article, which I found helpful.
The other approach you could take if you still run into throttling issues with your external API is to set up a CloudWatch event that wakes up every so often (let's say every 5 minutes) and explicitly calls your lambda. You'd need to retrofit your Lambda to explicitly read messages from the queue, and then process them. This would give you a little more control to "sip" messages using the receiveMessage method on the SQS SDK.

Real-time monitoring of SQS queue in AWS

What's the best way to provide real-time monitoring of the total count of messages sent to an SQS queue?
I currently have a Grafana dashboard set up to monitor an SQS queue, but it seems to refresh about every two minutes. I'm looking to get something set up to update almost in real-time, e.g. refresh every second.
The queue I'm using consumes around 6,000 messages per minute.
Colleagues of mine have built something for real-time monitoring of uploads to an S3 bucket, using a lambda to populate a PostgreSQL DB and using Grafana to query this.
Is this the best way of achieving this? Is there a more efficient way?

SQS is not event driven - it must be polled. Therefore, there isn't an event each time a message is put into the queue or removed from it. With S3 to Lambda there is an event sent in pretty much real time every time an object has been created or removed.
You can change the polling interval for SQS and poll as fast as you'd like. But be aware that polling does have a cost. The first 1 million requests a month are free.

I'm not sure what you're trying to accomplish (I'll address after my idea), but there's certainly a couple ways you could accomplish this. Each has positive and negative.
In every place you produce or consume messages, increment or decrement a cloudwatch metric (or datadog, librato, etc). It's still polling-based, but you could get the granularity down (even by using Cloudwatch) to 15-60 seconds. The biggest problem here is that it's error prone (what happens if the SQS message times out and gets reprocessed?).
Create a secondary queue. Each message that goes into this queue is either a "add" or "delete" message. Attach a lambda, container, autoscale group to process the queue and update metrics in an RDS or DynamoDB table. Query the table as needed.
Use a different queue processing system instead of SQS. I've seen RabbitMQ and Sensu used in very large environments, they will easily handle 6,000 messages per minute.
Keep in mind, there are a lot more metrics than just number of messages in the queue. I've recently become really fond of ApproximateAgeOfOldestMessage, because it indicates whether messages are being processed without error. Here's a blog post about the most helpful SQS metrics. It's called How to Monitor Amazon SQS with CloudWatch

AWS Lambda processing stream from DynamoDB

I'm trying to create a lambda function that is consuming a stream from dynamoDB table. However I was wondering which is the best practice to handle data that may not have been processed for some errors during the execution? For example my lambda failed and I lost part of the stream, which is the best way to reprocess the lost data?

This is handled for you. DynamoDB Streams, like Kinesis Streams, will resend records until they have been successfully processed. When you are using Lambda to process the stream, that means successfully exiting the function. If there is an error and the function exits unexpectedly, the DynamoDB stream will simply resend the record that was being processing.
The good thing is you are guaranteed at-least-once processing however, there are some things you need to look out for. Like Kinesis Streams, DynamoDB Streams are guaranteed to processes records in order. As a side effect of this, when a record fails to process, it is retried until it is successfully processed or it expires from the stream (possibly days) before processing any records behind it in the stream.
How you solve for this depends on the needs of your application. If you need at-least-once processing but don't need to guarantee that all records are processed in order, I would just drop the records into an SQS queue and do the processing off of the queue. SQS queues will also retry records that aren't successfully processed however, unlike DynamoDB and Kinesis Streams, records will not block each other in the queue. If you encounter an error when transferring a record from the DynamoDB Stream to the SQS Queue, you can just retry however, this may introduce duplicates in the SQS Queue.
If order is critical or duplicates can't be tolerated, you can use a SQS FIFO Queue. SQS FIFO Queues are similar to (Standard) SQS Queues except they they are guaranteed to deliver messages to the consumer in order and have a deduplication window (5 mins) where any duplicates added to the queue within that window will be discarded.
In both cases, when using SQS queues to process messages, you can setup a Dead Letter Queue where messages can automatically be sent if they fail to be processed N number of times.
TLDR: Use SQS Queues.

Updating this thread as all the existing answers are stale.
AWS Lambda now supports the DLQs for synchronous steam read from DynamoDB table stream.
With this feature in context, here is the flow that I would recommend:
Configure the event source mapping to include the DLQ arns and set the retry-attempts count. After these many retry, the batch metadata would then be moved to DLQs.
Set-up alarm on DLQ message visibility to get alert on impacted records.
DLQ message can be used to retrieve the impacted stream record using KCL library
ProTip: you can use attribute "Bisect on Function Error" to enable batch splitting. With this option, lambda would be able to narrow down on the impacted record.

DynamoDB Streams invokes the Lambda function for each event untill it successfully processes it (Untill the code calls success callback).
In an error situation while executing, you need to handle it in code unless otherwise the Lambda won't continue with the remaining messages in the stream.
If there is a situation where you need to process the message separate due to an error, you can use the dead letter queue (with Amazon SQS) to push the message and continue with the remaining items in the stream. You can have a separate logic to process the messages in this queue.

AWS SQS - Queue not delivering any messages until Visibility Timeout expires for one message

EDIT: Solved this one while I was writing it up :P -- I love those kind of solutions. I figured I'd post it anyway, maybe someone else will have the same problem and find my solution. Don't care about points/karma, etc. I just already wrote the whole thing up, so figured I'd post it and the solution.
I have an SQS FIFO queue. It is using a dead letter queue. Here is how it had been configured:
I have a single producer microservice, and I have 10 ECS images that are running as consumers.
It is important that we process the messages close to the time they are delivered in the queue for business reasons.
We're using a fairly recent version of the AWS SDK Golang client package for both producer and consumer code (if important, I can go look up the version, but it is not terribly outdated).
I capture the logs for the producer so I know exactly when messages were put in the queue and what the messages were.
I capture aggregate logs for all the consumers, so I have a full view of all 10 consumers and when messages were received and processed.
Here's what I see under normal conditions looking at the logs:
Message put in the queue at time x
Message received by one of the 10 consumers at time x
Message processed by consumer successfully
Message deleted from queue by consumer at time x + (0-2 seconds)
Repeat ad infinitum for up to about 700 messages / day at various times per day
But the problem I am seeing now is that some messages are not being processed in a timely manner. Occasionally we fail processing a message deliberately b/c of the state of the system for that message (e.g. maybe users still logged in, so it should back off and retry...which it does). The problem is if the consumer fails a message it is causing the queue to stop delivering any other messages to any other consumers.
"Failure to process a message" here just means the message was received, but the consumer declared it a failure, so we just log an error, and do not proceed to delete it from the queue. Thus, the visibility timeout (here 5m) will expire and it will be re-delivered to another consumer and retried up to 10 times, after which it will go to the dead letter queue.
After delving into the logs and analyzing it, here's what I'm seeing:
Process begins like above (message produced, consumed, deleted).
New message received at time x by consumer
Consumer fails -- logs error and just returns (does not delete)
Same message is received again at time x + 5m (visibility timeout)
Consumer fails -- logs error and just returns (does not delete)
Repeat up to 10x -- message goes to dead-letter queue
New message received but it is now 50 minutes late!
Now all messages that were put in the queue between steps 2-7 are 50 minutes late (5m visibility timeout * 10 retries)
All the docs I've read tells me the queue should not behave this way, but I've verified it several times in our logs. Sadly, we don't have a paid AWS support plan, or I'd file a ticket with them. But just consider the fact that we have 10 separate consumers all reading from the same queue. They only read from this queue. We don't have any other queues it is using.
For de-duplication we are using the automated hash of the message body. Messages are small JSON documents.
My expectation would be if we have a single bad message that causes a visibility timeout, that the queue would still happily deliver any other messages it has available while there are available consumers.

OK, so turns out I missed this little nugget of info about FIFO queues in the documentation:
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues.html
When you receive a message with a message group ID, no more messages
for the same message group ID are returned unless you delete the
message or it becomes visible.
I was indeed using the same Message Group ID. Hadn't given it a second thought. Just be aware, if you do that and any one of your messages fails to process, it will back up all other messages in the queue, until the time that the message is finally dealt with. The solution for me was to change the message group id. There is some business logic id I can postfix on it that will work for me.

Azure Queues - Functions - Message Visibility - Workers?

I have some questions regarding the capabilities regarding Azure Queues, Functions, and Workers. I'm not really sure how this works.
Scenario:
q-notifications is an queue in an Azure storage account.
f-process-notification is a function in Azure that is bound to q-notifications. Its job is to get the first message on the queue and process it.
In theory when a message is added to q-notifications, the function f-process-notification should be called.
Questions:
Does the triggered function replace the need to have workers? In other words, is f-process-notification called each time a message is placed in the queue.
Suppose I place a message on the queue that has a visibility timeout of 5 minutes. Basically I am queueing the message but it shouldn't be acted on until 5 minutes pass. Does the queue trigger f-process-notification immediately when the message is placed on the queue, or will it only trigger f-process-notification when the message becomes visible, i.e. 5 minutes after it is placed on the queue?

In Azure Functions, each Function App instance running your queue triggered function will have its own listener for the target queue. It monitors the queue for new work using an exponential backoff strategy. When new items are added to the queue the listener will pull multiple items off of the queue (batching behavior is configurable) and dispatch then in parallel to your function. If your function is successful, the message is deleted, otherwise it will remain on the queue to be reprocessed. To answer your question - yes we respect any visibility timeout you specify. If a message is added with a 5 minute timeout it will only be processed after that.
Regarding scale out - when N instances of your Function App are running they will all cooperate in processing the queue. Each queue listener will independently pull batches of messages off the queue to process. In effect, the work will be load balanced across the N instances. Exactly what you want :) Azure Functions is implementing all the complexities of the multiple consumer/worker pattern for you behind the scenes.

I typically use a listener logic as opposed to triggers. The consumer(s) are constantly monitoring the queue for messages. If you have multiple consumers, for example 5 instances of the consuming code in different Azure worker roles processing the same bus/queue, the first consumer to get the message wins (they are "competing"). This provides a scaling scenario common in a SOA architecture..
This article describes some of the ways to defer processing.
http://markheath.net/post/defer-processing-azure-service-bus-message
good luck!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio