How does Azure Event Grid handle failure when there are multiple subscribers? - azure-eventgrid

The documentation for Event Grid states that it has a delivery and retry mechanism built in, and gives an example of what would classify as a successful or failed attempt. The documentation is very clear about what happens with a single event handler.
My question is, what happens if there are multiple event handlers, and only one handler fails to receive the event? Is the event retried only for that handler, or will all handlers see the retry?

Basically, the Azure Event Grid eventing Pub/Sub model can handle two messaging/mediation patterns such as Fan-In pattern and Fan-Out (broadcasting) pattern. The following screen snippets show their differences:
The logical connectivity between the Event Source and Event Sink is described by Subscription which it is basically a metadata artifact of the Pub/Sub model. Each logical connectivity (represented by Subscription) is independent and loosely decouple to others. In other words, each subscriber can handle in this Pub/Sub model only one logical connectivity such as only one event source.
Your question is related to the Fan-Out (broadcasting) pattern, where the event interest is broadcasting to the multiple subscribers using a PushWithAck delivery mode. Each subscription within this Fan-Out pattern has own "a message state delivery machine" declared by subscriber such as retrying option, deadlettering, filtering, etc.
In other words, the event delivery to the subscribers is processing in parallel way based on their subscription in the transparent manner without any dependences each other. Note, that the subscriber doesn't have any information about who, where, how, etc. are event delivering to other once, so each subscriber can see only own delivery state, for instance, the value of the Aeg-Delivery-Count shows a retry counter of the state machine.
So, in the case of the failed event delivery to the one of the multiple subscribers, the enabled retrying process is performing only for that subscriber.

As Roman explained, each endpoint is handled independently. If one event handler fails, it will be retried without affecting the other event handlers, and of course, if that particular endpoint continues to fail, it will eventually be deadlettered (assuming deadlettering has been configured on the event subscription), or dropped.

When coming to event publishing in event grids, the events from custom event grid topics, or system event grid topics(say Service Bus Namespaces) are forwarded to the event grid subscriptions configured with them. The events are then sent to the endpoints configured with the event grid subscription.
Whenever the event delivery to an endpoint fails, it is retried based on the retry policy configured.If the number of retries exceed the retry policy configured, the events are stored in the storage account blob if configured as the dead-letter destination, else the events will be lost.
By default, Event Grid expires all events that aren't delivered within 24 hours. You can customize the retry policy when creating an event subscription. You provide the maximum number of delivery attempts (default is 30) and the event time-to-live (default is 1440 minutes).
When there are multiple subscribers(event grid subscriptions) to a same event grid topic, retry occurs only with the event grid subscription whose event delivery has failed.
refer Event Grid message delivery and retry for more info on retry policy.

Related

Increasing retries in AWS Lambda

I had a few questions about AWS lambdas and I couldn't find much details in the documentation
How can I increase the number of retries in AWS Lambda?
In case the maximum number of retries have occurred and whole Lambda has failed how can I get some sort of a notification?
Lambda retries are based upon many factors. I suggest you take a look into the official docs to understand every single type of retry, but long story short:
Stream-Based Synchronous Event sources can either retry or not. It depends on the Service.
Asynchronous Event sources will retry up to three times. If all messages fail, you can then configure a DLQ to receive the failed messages
(Stream-Based && Poll-Based Event Sources) (like Kinesis or Dynamo) will retry until the configured data retention. Be careful because if one message fails and the message itself is poisonous, it will keep retrying until it expires and no new messages will be processed
(Non-Stream-Based && Poll-Based Event Sources) (SQS) will discard messages in case of failure (unless it was an invocation failure or a timeout). If discarded, they will be sent to a DLQ if you previously configured one.
So, based on the information above, we can then tackle your question around notifications: you can have another Lambda subscribed to your DLQ to receive the message and notify the way you want. Either by sending an e-mail from the function itself or sending it to SNS (to possibly send an e-mail directly or do whatever you want with it).
For the retry amount, that's not configurable for the already built-in values. The furthest you can go is to invoke a Lambda function synchronously from your code and, in case of exception, retry it as you wish (the exponential backoff, if desired, would also need to be coded manually)

Event sourcing - error handling when events are not created

As to my understanding, in event sourcing, events are recorded. However that would also mean a state changed first happened and thereafter we record the event. For example, assuming:
A Client sends a command to a server to "Create user".
The server validates the command and creates user i.e. stores new
user in a database.
The server then logs/stores a Created User event. i.e event
sourcing.
Created User event is propagated to subscribers
In the scenario above, how do we handle scenarios where step (2) succeeded but step (3) failed due to say network failures, database offline etc? The whole system would be in an indeterminate state now that there was a new user created but the event was never logged. How do we mitigate these types of failures? Or are the steps that I've listed above not the way to do event sourcing?
Thanks!
This is not what happens exactly in Event sourcing, not even in plain CQRS.
In Event sourcing, after the command is validated, the domain events are generated by the source (the Aggregate in DDD) and then they are appended to the Event store in the first step. After that the subscribers (read models, projections, Sagas, external systems) receive and process the new domain events.
In CQRS, after the domain events are generated, they are applied to the Aggregate and then the Aggregate's state and the new events are persisted in the same local transaction, as the first step. Only after that the subscribers receive the events.
So you see? Your situation cannot happen: steps 2 and 3 are persisted atomically, they succeed or fail together.

How to handle side effects based on multiple events in a message driven microservice system?

we are currently working in a message driven Microservice environment and some of our messages/events are event sourced (using Apache Kafka). Now we are struggling with implementing more complex business requirements, were we have to take multiple events into account to create new events and side effects.
In the current situation we are working with devices that can produce errors and we already process them and have a single topic which contains ERROR_OCCURRED and ERROR_RESOLVED events (so they are in order). We also make sure, that all messages regarding a specific device always go onto the same partition. And both messages share an ID that identifies that specific error incident. We already have a projection that consumes those events and provides an API for our customers, s.t. they can see all occurred errors and their current state.
Now we have to deal with the following requirement:
Reporting Errors
We need a push system that reports errors of devices to our external partners, but only after 15 minutes and if they have not been resolved in that timeframe. Our first approach was to consume all ERROR_RESOLVED events, store the IDs and have another consumer that is handling the ERROR_OCCURRED events in a delayed fashion (e.g. by only consuming the next ERROR_OCCURRED event on the topic if its timestamp is at least 15 minutes old). We would then be able to know if that particular error has already been resolved and does not need to be reported (since they share a common ID with the corresponding ERROR_RESOLVED event). Otherwise we send an HTTP request to our external partner and create an ERROR_REPORTED event on a new topic. Is there any better approach for delayed and conditional message processing?
We also have to take the following special use cases into account:
Service restarts: currently we are planning to keep the list of resolved errors in memory, so if a service restarts, that list has to be created from scratch. We could just replay the ERROR_RESOLVED messages, but that may take some time and in that time no ERROR_OCCURRED events should be processed because that may result in reporting errors that have been resolved in less then 15 minutes, but we are just not aware of it. Are there any good practices regarding replay vs. "normal" processing?
Scaling: we may increase or decrease the number of instances of our service at any time, so the partition assignment may change during runtime. That should not be a problem if we create a consumer group for each service instance when consuming the ERROR_RESOLVED events, s.t. every instance knows all resolved errors while still only handling the ERROR_OCCURRED events of its assigned partitions (in another consumer group which is shared by all instances). Is there a better approach for handling partition reassignment and internal state?
Thanks in advance!
For side effects, I would record all "side" actions in the event store. In your particular example, when it is time to send a notification, I would call SEND_NOTIFICATION command that emit NOTIFICATION_SENT event. These events would be processed by some worker process that does actual HTTP request.
Actually I would elaborate this even furter, since notifications could fail, so I would have, say, two events NOTIFICATION_REQUIRED, and NORIFICATION_SENT, so we can retry failed notifications.
And finally your logic would be "if error was not resolved in 15 minutes and notification was not sent - send a notification (or just discard if it missed its timeframe)"

If nobody needs reliable messaging on transport level, how to implement reliable PubSub on business level?

This question is mostly out of curiosity. I read this article about WS-ReliableMessaging by Marc de Graauw some time ago and agreed that reliable messaging should be applied on the business level as whenever possible.
Now, the question is, he explains clearly what his approach is in a point-to-point fashion. However, I fail to see how you could implement reliable messaging on the business level in a Publish/Subscribe situation.
I will try to demonstrate the difference by showing commands (point-to-point) vs. events (publish/subscribe). Note that these examples are highly simplified.
Command: Transfer(uniqueId, amount, sourceAccount, recipientAccount)
If the account holder sends this transfer, he could wait for the confirmation MoneyTransferred (assuming this event will contain a reference to the uniqueId in the Transfer command.
If the account holder doesn't received the MoneyTransferred within a given timeout period, he could send the same command again. (of course assuming the command processor is idempotent)
So I see how reliable messaging could work on business level in a point-to-point fashion.
Now, say we the previous command succeeded and produced a MoneyTransferred event. Somewhere in the system we have an event processor (MoneyTransferEmailNotifier) that handles MoneyTransferred events and will send an email notification to the recipient of the transfer.
This MoneyTransferEmailNotifier is subscribed to MoneyTransferred events. But note that system sending the MoneyTransferred event does not really care who or how many listeners there are to this event. The whole point is the decoupling here. I raise an event and don't care if there zero or 20 listeners that subscribe to this event.
At this point, if there is no reliable messaging (minimally at-least-once-delivery) provided by the infrastructure, how can we prevent the loss of the MoneyTransferred event? I do want the recipient to get his e-mail notification.
I fail to see how any real 'business-level' solution will resolve this.
(1) One of the solutions I can think of is by explicitly subscribing to events on 'business level' and thereby bypassing any infrastructure component. But aren't we at that moment introducing infrastructure in our business?
(2) The other 'solution' would be by introducing a process manager that does something like this:
PM receives Transfer command
PM forwards Transfer command to the accounts subsystem
If successful, sends command SendEmailNotification(recipient) to the notification subsystem
This does seem to be the solution that DDD prescribes, correct? But doesn't this introduce more coupling?
What do you think?
Edit 2016-04-16
Maybe the root question is a little bit more simplistic: If you do not have an infrastructural component that ensures at-least or exactly-once delivery, how can you ensure (when you're in an at-most-once infrastructure) that your events emitted will be received?
Not all events need to be delivered but there are many that are key (like the example of sending the confirmation email)
This MoneyTransferEmailNotifier is subscribed to MoneyTransferred events. But note that system sending the MoneyTransferred event does not really care who or how many listeners there are to this event. The whole point is the decoupling here. I raise an event and don't care if there zero or 20 listeners that subscribe to this event.
Your tangle, I believe, is here - that only the publish subscribe middleware can deliver events to where they need to go.
Greg Young covers this in his talk on polyglot data (slides).
Summarizing: the pub/sub middleware is in the way. A pull based model, where consumers retrieve data from the durable event store gives you a reliable way to retrieve the messages from the store. So you pull the data from the store, and then use the business level data to recognize previous work as before.
For instance, upon retrieving the MoneyTransferred event with its business data, the process manager looks around for an EmailSent event with matching business data. If the second event is found, the process manager knows that at least one copy of the email was successfully delivered, and no more work need be done.
The push based models (pub/sub, UDP multicast) become latency optimizations -- the arrival of the push message tells the subscriber to pull earlier than it normally would.
In the extreme push case, you pack into the pushed message enough information that the subscriber(s) can act upon it immediately, and trust that the idempotent handling of the message will prevent problems when the redundant copy of the message arrives on the slower channel.
If nobody needs reliable messaging on transport level, how to implement reliable PubSub on business level?
The original article does not state that "nobody needs reliable messaging on transport level", it states that the ordering of messages should be enforced at the business level because, in some cases, if this ordering is an important characteristic of the business.
In any case, PubSub is at the infrastructure level, you can't say that you implement PubSub at the business level. It doesn't make sense.
But then how you could ensure only-once-delivery at the business level? By using a Saga/Process manager. On of the important responsibilities of them is exactly that. You can combine that with idempotent Aggregates. Also, you could identify terms that emphasis ordering from the Ubiquitous language like transaction phase and include them in your domain models (for example as properties of the events).
If you do not have an infrastructural component that ensures at-least
or exactly-once delivery, how can you ensure (when you're in an
at-most-once infrastructure) that your events emitted will be
received?
If you do not have at-least-once then you could use the first event that it is initiating the hole process. I would use event polling and a Saga that ensure that every important step in the process is reached at the right moment.
In your case, as the sending of the email is an important business aspect, I would include it as a step in the process.

nservicebus: events and dead letter queue

Using the Pub/Sub model with NSB, the following two scenarios seemingly cause the dead-letter queue to fill up, eventually resulting in a "Insufficient resources" error.
1) Publishing an event type that has no subscribers
2) Subscriber is offline
For our purposes we are not interested in historical events when the subscriber starts up, so the incoming queue is purged on startup. Events published while the subscriber is offline fill up the dead-letter queue, however.
Have i misunderstood the command vs. event? This is the behaviour i was expecting from Commands, but expected events to disappear if not subscribed to.
When using NServiceBus, events are considered just as important as commands, and thus are subject to the same guarantees regarding durability, delivery, etc.
So, if your subscriber does not care about events when it is offline, it could unsubscribe before shutting down - this way, it's an explicit decision made by your subscriber that it does not care about what happens when it's not around to hear it... just make sure that it doesn't get confused or chokes somehow if there's a few (old) events lying in its input queue when it comes back online later on, because stuff might get published in the time between the unsubscribe message is sent and it gets to the publisher.
Another option is to supply the [TimeToBeReceived(...)] attribute on your event messages, but that should only be used if it can be safely determined that the event contents lose their relevance after a fixed time for all subscribers.

Resources