An event store could become a single point of failure? - microservices

Since a couple of days I've been trying to figure it out how to inform to the rest of the microservices that a new entity was created in a microservice A that store that entity in a MongoDB.
I want to:
Have low coupling between the microservices
Avoid distributed transactions between microservices like Two Phase Commit (2PC)
At first a message broker like RabbitMQ seems to be a good tool for the job but then I see the problem of commit the new document in MongoDB and publish the message in the broker not being atomic.
Why event sourcing? by eventuate.io:
One way of solving this issue implies make the schema of the documents a bit dirtier by adding a mark that says if the document have been published in the broker and having a scheduled background process that search unpublished documents in MongoDB and publishes those to the broker using confirmations, when the confirmation arrives the document will be marked as published (using at-least-once and idempotency semantics). This solutions is proposed in this and this answers.
Reading an Introduction to Microservices by Chris Richardson I ended up in this great presentation of Developing functional domain models with event sourcing where one of the slides asked:
How to atomically update the database and publish events and publish events without 2PC? (dual write problem).
The answer is simple (on the next slide)
Update the database and publish events
This is a different approach to this one that is based on CQRS a la Greg Young.
The domain repository is responsible for publishing the events, this
would normally be inside a single transaction together with storing
the events in the event store.
I think that delegate the responsabilities of storing and publishing the events to the event store is a good thing because avoids the need of 2PC or a background process.
However, in a certain way it's true that:
If you rely on the event store to publish the events you'd have a
tight coupling to the storage mechanism.
But we could say the same if we adopt a message broker for intecommunicate the microservices.
The thing that worries me more is that the Event Store seems to become a Single Point of Failure.
If we look this example from eventuate.io
we can see that if the event store is down, we can't create accounts or money transfers, losing one of the advantages of microservices. (although the system will continue responding querys).
So, it's correct to affirmate that the Event Store as used in the eventuate example is a Single Point of Failure?

What you are facing is an instance of the Two General's Problem. Basically, you want to have two entities on a network agreeing on something but the network is not fail safe. Leslie Lamport proved that this is impossible.
So no matter how much you add new entities to your network, the message queue being one, you will never have 100% certainty that agreement will be reached. In fact, the opposite takes place: the more entities you add to your distributed system, the less you can be certain that an agreement will eventually be reached.
A practical answer to your case is that 2PC is not that bad if you consider adding even more complexity and single points of failures. If you absolutely do not want a single point of failure and wants to assume that the network is reliable (in other words, that the network itself cannot be a single point of failure), you can try a P2P algorithm such as DHT, but for two peers I bet it reduces to simple 2PC.

We handle this with the Outbox approach in NServiceBus:
http://docs.particular.net/nservicebus/outbox/
This approach requires that the initial trigger for the whole operation came in as a message on the queue but works very well.

You could also create a flag for each entry inside of the event store which tells if this event was already published. Another process could poll the event store for those unpublished events and put them into a message queue or topic. The disadvantage of this approach is that consumers of this queue or topic must be designed to de-duplicate incoming messages because this pattern does only guarantee at-least-once delivery. Another disadvantage could be latency because of the polling frequency. But since we have already entered the eventually consistent area here this might not be such a big concern.

How about if we have two event stores, and whenever a Domain Event is created, it is queued onto both of them. And the event handler on the query side, handles events popped from both the event stores.
Ofcourse every event should be idempotent.
But wouldn’t this solve our problem of the event store being a single point of entry?

Not particularly a mongodb solution but have you considered leveraging the Streams feature introduced in Redis 5 to implement a reliable event store. Take a look this intro here
I find that it has rich set of features like message tailing, message acknowledgement as well as the ability to extract unacknowledged messages easily. This surely helps to implement at least once messaging guarantees. It also support load balancing of messages using "consumer group" concept which can help with scaling the processing part.
Regarding your concern about being the single point of failure, as per the documentation, streams and consumer information can be replicated across nodes and persisted to disk (using regular Redis mechanisms I believe). This helps address the single point of failure issue. I'm currently considering using this for one of my microservices projects.

Related

Is Event sourcing using Database CDC considered good architecture?

When we talk about sourcing events, we have a simple dual write architecture where we can write to database and then write the events to a queue like Kafka. Other downstream systems can read those events and act on/use them accordingly.
But the problem occurs when trying to make both DB and Events in sync as the ordering of these events are required to make sense out of it.
To solve this problem people encourage to use database commit logs as a source of events, and there are tools build around it like Airbnb's Spinal Tap, Redhat's Debezium, Oracle's Golden gate, etc... It solves the problem of consistency, ordering guaranty and all these.
But the problem with using the Database commit log as event source is we are tightly coupling with DB schema. DB schema for a micro-service is exposed, and any breaking changes in DB schema like datatype change or column name change can actually break the downstream systems.
So is using the DB CDC as an event source a good idea?
A talk on this problem and using Debezium for event sourcing
Extending Constantin's answer:
TLDR;
Transaction log tailing/mining should be hidden from others.
It is not strictly an event-stream, as you should not access it directly from other services. It is generally used when transitioning a legacy system gradually to a microservices based. The flow could look like this:
Service A commits a transaction to the DB
A framework or service polls the commit log and maps new commits to Kafka as events
Service B is subscribed to a Kafka stream and consumes events from there, not from the DB
Longer story:
Service B doesn't see that your event is originated from the DB nor it accesses the DB directly. The commit data should be projected into an event. If you change the DB, you should only modify your projection rule to map commits in the new schema to the "old" event format, so consumers must not be changed. (I am not familiar with Debezium, or if it can do this projection).
Your events should be idempotent as publishing an event and committing a transaction
atomically is a problem in a distributed scenario, and tools will guarantee at-least-once-delivery with exactly-once-processing semantics at best, and the exactly-once part is rarer. This is due to an event origin (the transaction log) is not the same as the stream that will be accessed by other services, i.e. it is distributed. And this is still the producer part, the same problem exists with Kafka->consumer channel, but for a different reason. Also, Kafka will not behave like an event store, so what you achieved is a message queue.
I recommend using a dedicated event-store instead if possible, like Greg Young's: https://eventstore.org/. This solves the problem by integrating an event-store and message-broker into a single solution. By storing an event (in JSON) to a stream, you also "publish" it, as consumers are subscribed to this stream. If you want to further decouple the services, you can write projections that map events from one stream to another stream. Your event consuming should be idempotent with this too, but you get an event store that is partitioned by aggregates and is pretty fast to read.
If you want to store the data in the SQL DB too, then listen to these events and insert/update the tables based on them, just do not use your SQL DB as your event store cuz it will be hard to implement it right (failure-proof).
For the ordering part: reading events from one stream will be ordered. Projections that aggregates multiple event streams can only guarantee ordering between events originating from the same stream. It is usually more than enough. (btw you could reorder the messages based on some field on the consumer side if necessary.)
If you are using Event sourcing:
Then the coupling should not exist. The Event store is generic, it doesn't care about the internal state of your Aggregates. You are in the worst case coupled with the internal structure of the Event store itself but this is not specific to a particular Microservice.
If you are not using Event sourcing:
In this case there is a coupling between the internal structure of the Aggregates and the CDC component (that captures the data change and publish the event to an Message queue or similar). In order to limit the effects of this coupling to the Microservice itself, the CDC component should be part of it. In this way when the internal structure of the Aggregates in the Microservice changes then the CDC component is also changed and the outside world doesn't notice. Both changes are deployed at the same time.
So is using the DB CDC as an event source a good idea?
"Is it a good idea?" is a question that is going to depend on your context, the costs and benefits of the different trade offs that you need to make.
That said, it's not an idea that is consistent with the heritage of event sourcing as I learned it.
Event sourcing - the idea that our book of record is a ledger of state changes - has been around a long long time. After all, when we talk about "ledger", we are in fact alluding to those documents written centuries ago that kept track of commerce.
But a lot of the discussion of event sourcing in software is heavily influenced by domain driven design; DDD advocates (among other things) aligning your code concepts with the concepts in the domain you are modeling.
So here's the problem: unless you are in some extreme edge case, your database is probably some general purpose application that you are customizing/configuring to meet your needs. Change data capture is going to be limited by the fact that it is implemented using general purpose mechanisms. So the events that are produced are going to look like general purpose patch documents (here's the diff between before and after).
But if we trying to align our events with our domain concepts (ie, what does this change to our persisted state mean), then patch documents are a step in the wrong direction.
For example, our domain might have multiple "events" that make changes to the same, or very similar, sets of fields in our model. Trying to rediscover the motivation for a change by reverse engineering the diff is kind of a dumb problem to have; especially when we have already fought with the same sort of problem learning user interface design.
In some domains, a general purpose change is good enough. In some contexts, a general purpose change is good enough for now. Horses for courses.
But it's not really the sort of implementation that the "event sourcing" community is talking about.
Besides Constantin Galbenu mentioned CDC component side, you can also do it in event storage side like Kafka stream API.
What is Kafka stream API? Input is read from one or more topics in order to generate output to one or more topics, effectively transforming the input streams to output streams.
After transfer detailed data to abstract data, your DB schema is only bind with the transformation now and can release the tightly relation between DB and subscribers.
If your data schema need to change a lot, maybe you should add a new topic for it.

CQRS - out of order messages

Suppose we have 3 different services producing events, each of them publishing to its own event store.
Each of these services consumes other producers services events.
This because each service has to process another service's events AND to create its own projection. Each of the service runs on multiple instances.
The most straight forward way to do it (for me) was to put "something" in front of each ES which is picking events and publishing (pub/sub) them in queues of every other service.
This is perfect because every service can subscribe to each topics it likes, while the event publisher is doing the job and if a service is unavailable events are still delivered. This seems to me to guarantee high scalability and availability.
My problem is the queue. I can't get an easily scalable queue that guarantees ordering of the messages. It actually guarantees "slightly out of order" with at-least once delivery: to be clear, it's AWS SQS.
So, the ordering problems are:
No order guaranteed across events from the same event stream.
No order guaranteed across events from the same ES.
No order guaranteed across events from different ES (different services).
I though I could solve the first two problems just by keeping track of the "sequence number" of the events coming from the same ES.
This would be done by tracking the last sequence number of each topic from which we are consuming events
This should be easy for reacting to events and also building our projection.
Then, when I pop an event from the queue, if the eventSequenceNumber > previousAppliedEventSequenceNumber + 1 i renqueue it (or make it invisible for a certain time).
But it turns out that using this solution, it will destroy performances when events are produced at high rates (I can use a visibility timeout or other stuff, the result should be the same).
This because when I'm expecting event 10 and I ignore event 11 for a moment, I should ignore also all events (from ES) with sequence numbers coming after that event 11, until event 11 shows up again and it's effectively processed.
Other difficulties were:
where to keep track of the event's sequence number for build the projection.
how to keep track of the event's sequence number for build the projection so that when appling it, I have a consistent lastSequenceNumber.
What I'm missing?
P.S.: for the third problem think at the following scenario. We have a UserService and a CartService. The CartService has a projection where for each user keeps track of the products in the cart. Each cart's projection must have also user's name and other info's that are coming from the UserCreated event published from the UserService. If UserCreated comes after ProductAddedToCart the normal flow requires to throw an exception because the user doesn't exist yet.
What I'm missing?
You are missing flow -- consumers pull messages from sources, rather than having sources push the messages to the consumers.
When I wake up, I check my bookmark to find out which of your messages I read last, and then ask you if there have been any since. If there have, I retrieve them from you in order (think "document message"), also writing down the new bookmarks. Then I go back to sleep.
The primary purpose of push notifications is to interrupt the sleep period (thereby reducing latency).
With SQS acting as a queue, the idea is that you read all of the enqueued messages at once. If there are no gaps, then you can order the collection then start processing them and acking them. If there are gaps, you either wait (leaving the messages in the queue) or you go to the event store to fetch copies of the missing messages.
There's no magic -- if the message pipeline is promising "at least once" delivery, then the consumers must take steps to recognize duplicate messages as they arrive.
If UserCreated comes after ProductAddedToCart the normal flow requires to throw an exception because the user doesn't exist yet.
Review Race Conditions Don't Exist, by Udi Dahan: "A microsecond difference in timing shouldn’t make a difference to core business behaviors."
The basic issue is assuming we can get messages IN ORDER...
This is a fallacy in distributed computing...
I suggest you design for no message ordering in your system.
As for your issues, try and use UTC time in the message body/header created by the originator and try and work around this data point. Sequence numbers are going to fail unless you have a central deterministic sequence creator (which will be a non-scalable, single point of failure).
Using Sagas/State machine is a path that can help to make sense of (business) events ordering.

Remote persistent views with Lagom

In a classical microservice architecture, you have relevant domain events published on some messaging system which allows other parts of the system to react.
Now imagine you have three microservices: Customers, Orders and Recommendation. The Recommendation microservice needs information from Customers and Orders to provide its functionality, such as the list of all customers and all the orders, which is going to be analyzed from some machine learning algorithm. Now, you need to have the state of Customers "join" Orders on the Recommandation microservice:
You have the Recommandation microservice listen to domain events published by Customers and Orders and built its own state. This leads to logic duplication since you probably have that same logic inside Customers and Orders already
On each relevant domain message from Customers and Orders, you just go to them and ask the state of a specific customer or order. This works fine, however if you have N services rather than just one which needs to build a materialized view, you will cause a big load on Customers and Orders
You get Customers and Orders themselves publish "heavy-weight" events (not domain events) that allows any other microservice to build a materialized view without processing domain events. This allows you both a) not to duplicate the logic b) not to keep asking the same information
Has pattern n.3 some drawbacks we couldn't figure out and if not, how do you implement it in Lagom?
I will try to explain a few more bits in the hope to give you some more perspective on that matter and how you can achieve it in a reliable way in Lagom.
We have a few concepts that we must keep in mind. The most important one which is the source of all is Event Sourcing itself. Event Sourcing means that any State in the system has its source in Events.
The first State that we will deal with is the State of the PersistentEntity. This State is prominent because, together with the Command and Event Handler, it defines the consistency boundary of your model.
But there other States in the system. Actually, we can create as much as we want because we have the Event Journal. A read-model is also a State and it’s also generated from the events.
There are many reasons why you shouldn’t publish the State of the PersistentEntity to other systems. The first one being a matter of avoiding coupling. You don’t want your data to leak to other services. That’s all about having an anti-corruption layer (ACL).
So, from here we could say: before publishing Order and Customer to Recommendation Service, I will transform it to OrderView and CustomerView (ACL 101).
The question now is when will you do it? If you try to publish it in Kafka after you have handled a command, you don’t have any guarantee that the State will be published. There are no XA transactions between the event journal and the Kafka topic. So, there is a chance that the events are persisted, but for some reason, the State is not published in Kafka.
If you want data to get out of a service in a reliable way and without creating coupling between services, you have the following options:
Use the broker API and publish the events to a topic. You should not publish the events as they are, but transform them into the format of your external API (ACL).
Use a read-side processor to generate a view of it, again the external API format you want to make available. If you want, you can publish that ViewState to a topic so other services can consume it directly.
That said, there is nothing wrong in publishing something in a topic that is not a real event, but some derived State. The problem is how you can guarantee that it is effectively published. Doing that from inside the PersistentEntity is risky because you have at-most-once semantics. The most reliable way of doing it is a read-side process that gives you at-least-once semantics.
Further comments inline...
Listen to domain events from customer and orders and rebuild the state
in the recommandation service. This is a horrible idea because you
would need to duplicate the logic that handles events across different
bounded context
That's not a horrible idea. That's how you make your services independent from each other. The logic that you will need to implement to consume the events are not the same. As you said, it's a different bounded context, as such it only gets what it needs.
Leaking the State from a BC to another is more problematic for the reasons I mentioned above (anti-corruption layer).
To achieve decoupling you do need more coding and there is nothing wrong with that. At the end of the day, the reason for building microservices is to avoid coupling and be able to let the services evolve and scale without interfering with each other. There is a price to pay for that and the price is to write more code. You need to evaluate the thread-offs.
You can consume your own events, produce an OrderView and CustomerView and publish into Kafka, but that's the same as consuming the events directly on the Recommendation Service.
Note that you also need to store OrderView and CustomerView somewhere in the Recommendation Service. So you end up storing it three times. On the original service (view table), in Kafka and in the Recommendation Services.
That's why publishing events in a topic is the best option to propagate data between services.
Every time we receive a domain event from customers or orders, go to
them and ask them the state. This is horrible because if you have more
than one microservice that needs their state, you will end up
producing load on customers and orders
That is indeed a horrible idea because you will make the Recommendation Service be dependent on the other two services. If Order or Customer is down, the Recommendation will be down as well. That's what a broker helps to solve.
Have customers and orders not only publish events but also state and
having all the services that need to build materialized views listen
the state they need How do you apply the last pattern with Lagom? We
found no way to listen to state changes, just to events. One solution
we considered implied publishing with pubSub the state in the onEvent
handler of a persistent entity but I am not sure this is the right
place to make it happen.
Using pubSub in the onEvent handler is the worst solution of all. For the following reasons:
pubSub has at-most-once sematincs (see comments above)
Event handlers are called many times. Whenever you re-hydrate an Entity, the events are replayed and the the event handlers will be used for that. Which mean that you will re-publish the state each time. Actually, you would solve the at-most-once pubSub problem, but not the way you might expect/desire.
You could use the afterPersist callback for that, but that's not reliable neither because pubSub is at-most-once.
PubSub inside a PersistentEntity should not be used for something that you need to be reliable. It's a best-effort capability, that's all.

If nobody needs reliable messaging on transport level, how to implement reliable PubSub on business level?

This question is mostly out of curiosity. I read this article about WS-ReliableMessaging by Marc de Graauw some time ago and agreed that reliable messaging should be applied on the business level as whenever possible.
Now, the question is, he explains clearly what his approach is in a point-to-point fashion. However, I fail to see how you could implement reliable messaging on the business level in a Publish/Subscribe situation.
I will try to demonstrate the difference by showing commands (point-to-point) vs. events (publish/subscribe). Note that these examples are highly simplified.
Command: Transfer(uniqueId, amount, sourceAccount, recipientAccount)
If the account holder sends this transfer, he could wait for the confirmation MoneyTransferred (assuming this event will contain a reference to the uniqueId in the Transfer command.
If the account holder doesn't received the MoneyTransferred within a given timeout period, he could send the same command again. (of course assuming the command processor is idempotent)
So I see how reliable messaging could work on business level in a point-to-point fashion.
Now, say we the previous command succeeded and produced a MoneyTransferred event. Somewhere in the system we have an event processor (MoneyTransferEmailNotifier) that handles MoneyTransferred events and will send an email notification to the recipient of the transfer.
This MoneyTransferEmailNotifier is subscribed to MoneyTransferred events. But note that system sending the MoneyTransferred event does not really care who or how many listeners there are to this event. The whole point is the decoupling here. I raise an event and don't care if there zero or 20 listeners that subscribe to this event.
At this point, if there is no reliable messaging (minimally at-least-once-delivery) provided by the infrastructure, how can we prevent the loss of the MoneyTransferred event? I do want the recipient to get his e-mail notification.
I fail to see how any real 'business-level' solution will resolve this.
(1) One of the solutions I can think of is by explicitly subscribing to events on 'business level' and thereby bypassing any infrastructure component. But aren't we at that moment introducing infrastructure in our business?
(2) The other 'solution' would be by introducing a process manager that does something like this:
PM receives Transfer command
PM forwards Transfer command to the accounts subsystem
If successful, sends command SendEmailNotification(recipient) to the notification subsystem
This does seem to be the solution that DDD prescribes, correct? But doesn't this introduce more coupling?
What do you think?
Edit 2016-04-16
Maybe the root question is a little bit more simplistic: If you do not have an infrastructural component that ensures at-least or exactly-once delivery, how can you ensure (when you're in an at-most-once infrastructure) that your events emitted will be received?
Not all events need to be delivered but there are many that are key (like the example of sending the confirmation email)
This MoneyTransferEmailNotifier is subscribed to MoneyTransferred events. But note that system sending the MoneyTransferred event does not really care who or how many listeners there are to this event. The whole point is the decoupling here. I raise an event and don't care if there zero or 20 listeners that subscribe to this event.
Your tangle, I believe, is here - that only the publish subscribe middleware can deliver events to where they need to go.
Greg Young covers this in his talk on polyglot data (slides).
Summarizing: the pub/sub middleware is in the way. A pull based model, where consumers retrieve data from the durable event store gives you a reliable way to retrieve the messages from the store. So you pull the data from the store, and then use the business level data to recognize previous work as before.
For instance, upon retrieving the MoneyTransferred event with its business data, the process manager looks around for an EmailSent event with matching business data. If the second event is found, the process manager knows that at least one copy of the email was successfully delivered, and no more work need be done.
The push based models (pub/sub, UDP multicast) become latency optimizations -- the arrival of the push message tells the subscriber to pull earlier than it normally would.
In the extreme push case, you pack into the pushed message enough information that the subscriber(s) can act upon it immediately, and trust that the idempotent handling of the message will prevent problems when the redundant copy of the message arrives on the slower channel.
If nobody needs reliable messaging on transport level, how to implement reliable PubSub on business level?
The original article does not state that "nobody needs reliable messaging on transport level", it states that the ordering of messages should be enforced at the business level because, in some cases, if this ordering is an important characteristic of the business.
In any case, PubSub is at the infrastructure level, you can't say that you implement PubSub at the business level. It doesn't make sense.
But then how you could ensure only-once-delivery at the business level? By using a Saga/Process manager. On of the important responsibilities of them is exactly that. You can combine that with idempotent Aggregates. Also, you could identify terms that emphasis ordering from the Ubiquitous language like transaction phase and include them in your domain models (for example as properties of the events).
If you do not have an infrastructural component that ensures at-least
or exactly-once delivery, how can you ensure (when you're in an
at-most-once infrastructure) that your events emitted will be
received?
If you do not have at-least-once then you could use the first event that it is initiating the hole process. I would use event polling and a Saga that ensure that every important step in the process is reached at the right moment.
In your case, as the sending of the email is an important business aspect, I would include it as a step in the process.

Fault tolerant redundancy

This might result in biased and opinion based answers, if so I'll close the question but...
I have a rather basic requirement of improving our up-time and speed. As part of this I'm looking at the two main competing approaches, traditional pub/sub and akka.net. We don't have any issues currently or expect to have any need for concurrency control.
What we have is several basic workflows which are data analysis, manipulation and persistence of the result:
Step 1) Capture work to be done (IE what objects need to do some work)
Step 2) Execute that work load and produce a result
Step 3) Save result
Using traditional pub/sub This seems rather easy. Have micro services for each step, push a message at the end of each step with the data required (or more to the point data that might be useful) for the next step. Using any off the self message queue/topic/subscription software this provides a nice ability to:
1) geographically spread the loads around the world to where the source data is located
2) increase the number of "workers" that subscribe to increase through put
3) push to something central that can support the idea of connecting "workers" with a minimal learning curve
4) any component (or set of workers for a component) further down the workflow has/have a queue where the messages queue and wait for said component to come back online (even if the whole component disconnects)
5) adding new components that do something new and different, is as easy as registering a new subscription to a topic.
It's all pretty much out of the box easy joy... assuming sensible aggregate and bounded context patterns are adhered to here. I'm not seeking advise of how to write good distributed code, I'm looking for how deploy it, support it, debug rouge/missing/corrupt messages etc. Which is why I want to know what Akka.net offers.
I've seen there's Akka.net clustering . It may or may not be production ready yet, but best I understand what it can/could do for us.
So the main questions I have are:
1) Where are messages stored prior to arriving? So long as a publisher has access to the messaging bus/software endpoint, any such software will store and hold messages waiting for a subscriber to connect and pick up it's messages (obvious assumptions about the subscription having already been registered so the messages queue for it). How does Akka.net cluster handle all of this?
2) What tooling exists for operational support of these queues and mailboxes in Akka.net cluster? What tools give an operator insight into what is in a mailbox received but waiting to be processed and what tools exist for viewing what has been "published" and not yet "received"? Most competing Pub/Sub software has operational tools so I'm looking for some comparison here.
3) How do you debug rouge, missing or corrupt messages. We all know we should trust our software but a bad message can cause a system to spiral out of control, so how would I eject a bad message from the system? How can I modify a message so it's going to behave differently because the business needs something fixed at 3:30 am? How can I answer "where is my message" with "it IS in the system and it IS waiting to be received" or "it has been received and just in the mailbox"?
4) If a component goes down HARD (recycle, hardware failure what ever) what will restore the mailboxes, queues etc? Any message that's actually being processed has an acceptable lost tolerance, but 1000 messages in a mailbox getting lost isn't so tolerable, what persistence and tolerance is there?
5) The light review I've done appears to advocate for a supervisor pattern to be built into your software to marshal messages around (I'm guessing to manage and release concurrency locks?). Given concurrency isn't an issue here, what out of the box pub/sub mechanism do you support that isn't basic message remoting between two (or x internally defined in code) components? Again with subscriptions and topics in most pub/sub software, your first object pushes a message (it's central so it's a potential single point of failure) but that component (and neither doesn't any other code) have to be aware of what will consume that message. It's expansion nirvana compared the old school way where we manually pushed a message from one object to the next (and to the next), rebuilding or recompiling for each new class that same message had to go to. I'm keen to not have to build our own message router.
6) When all instances of a particular component go offline (say step 3 above) what remembers that there's actually something there that needs to queue and remember those messages (say the ones pushed blindly from step 2 above)? In other software, until you delete the subscription the messages keep queuing up based on what ever rules are defined for TTL etc. What is provided for this?

Resources