Making Automatonymous sagas multitenant - multi-tenant

We have successfully integrated our multitenancy strategy with MassTransit due to some help from Chris Patterson. However we are stumbling over getting our (Automatonymous) sagas multitenant. I have something that works but I am not at all comfortable with it. We are using the "schema per tenant" database strategy, but are willing to flex this for sagas if that is the cleanest way to solve it.
We have tenant ID on the header of all messages. We scrape it off the IConsumeContext<> of incoming messaging and put it back on the IPublishContext<> of outgoing messages. This works fine with ISagaRepository<>.GetSaga(...) because one of its parameters is IConsumeContext<>. The problem is, when we call the other ISagaRepository<> methods, they do not have IConsumeContext<>, and we have no way of filtering by tenant within the repository. If we stick with our current database strategy, we have know the tenant so we know what schema to hit. If we change to have centralized tenant tables, we have to include tenant in the filtering because the thing that it is being correlated by is not necessarily unique across tenants.
The PropertySagaLocator<,> seems to be the key point based on my current understanding. In its Find(IConsumeContext<>) method we have the tenant context we need accessible, but it is not being passed down to the saga repository.
In my current attempt to get this working, I have therefore created a property saga locator for multitenancy that works with a specialized tenant saga repository and gives it the tenant context that it needs to use its .Where(...) method appropriately. But here's where it gets ugly. The PropertySagaLocator<,> concrete class is being instantiated by Automatonymous, and so to swap this out, I have to start at the edge of Automatonymous, at one of the .StateMachineSaga(...) extension methods and swap out concrete classes all the way down to the point where it is integrating with MassTransit by using the PropertySagaLocator<,> since it is a chain of concrete classes instantiating each other all the way down. I am not comfortable with making such a deep cut through Automatonymous, but it seems to me that whether we take the "schema per tenant" strategy or switch it, we are stuck with needing to integrate at this same point.
The other aspect of this is that we need to put tenant ID on outgoing messages when Automatonymous' .Publish(...) notation is employed. The way that I am currently doing this is with a decorator pattern on ServiceBus, and currently the point at which I am injecting the decorated, tenant-specific service bus is when the bus is copied from the consume context to the instance state, i.e. in my overrides of the saga message sink GetHandlers() method.
Does anyone have experience with how to integrate Automatonymous sagas with multitenancy? What we are doing now just seems to invasive and we would like to hit a more natural seam.

I've found another approach that's a lot less invasive, but is more restrictive. Specifically, you cannot use the PropertySagaLocator, i.e. all your correlations have to be Correlation IDs by inheriting from the CorrelatedBy<> interface. Make sure you don't do any StateMachineSagaRepository<>.Correlate(...) calls, because if you do, it will use the property locator, even if you give it the actual Correlation ID.
What that allows me to do is avoid the use of any methods on the saga repository except GetSaga(...), where I have the context I need for our multitenancy strategy. I then throw a NotImplementedByDesignException in the others.
That leaves me with only one thing to worry about; how to get tenant ID on the headers of messages going out from .Publish(...) calls. To do this I just subclassesed ConsumeContext<> and simultaneously implemented IConsumeContext<> and then overrode the Bus property with new so that I could set the bus on it. I then had a decorator pattern of the service bus that ensures that the bus publishes with tenant header no matter what method you call on it. Then in my saga repository I wrapped the actions I return in a lambda that passes my subclassed consume context along with my tenant-specific decorated bus into the consumer for the saga instead of just the straight consume context. This results in the bus that gets set on the saga state being specific to that tenant, and all outgoing messages then have tenant ID on them.


Authorization of commands in Axon

Up until now I have been handling authorization in the CommandHandlers.
An example is I have an aggregate "Team" containing a list of managers (AggregateIdentifier from a User). All command handlers in the Team aggregate then verify the user executing the command is manager of the team.
The userId is injected as metadata in a CommandHandlerInterceptor based on the SecurityContext.
My main concern is, when I use sagas, it becomes an additional overhead to maintain the user context across the commands issued against different aggregates. Aside from that, the manager association can expire in the period the saga is running and subsequent failing commands, leading to an incomplete state which also needs to be handled with some rollback functionality.
Is it better to do the authorization in my controller layer to avoid the additional overhead or should I see it more as good practice to let my CommandHandlers decide whether the command is valid for the aggregate?
Authorization to perform certain operations/commands is something which I'd argue isn't domain specific logic. Instead, it is more a form of cross cutting concern which you need throughout your application. Thus, placing it in the #CommandHandler annotated method is not the ideal place in my head. However, placing it close by makes a lot of sense.
You have pointed out you are already using a CommandHandlerInterceptor to populate the Spring SecurityContext, thus I am assuming you are using a CommandDispatchInterceptor to populate the command's MetaData with information when you send a command out. This is a great use of the interceptor logic indeed, so I'd keep that in place. This however set's the information, it doesn't validate it.
To that end, you could build your own Handler Enhancer, which validates security metadata on a command. You could even build a dedicated annotation you'd add next to the #CommandHandler annotation, which describes the required roles. That way, the method still portrays what roles you need for the given command, but the actual validation can be done in this Handler Enhancer for you.
Now, let's circle back to your question:
Is it better to do the authorization in my controller layer to avoid the additional overhead or should I see it more as good practice to let my CommandHandlers decide whether the command is valid for the aggregate?
I think it's fine to do it in the aggregate, potentially making it cleaner through use of a Handler Enhancer. When it comes to your concern in the Saga, well, I think you should see that separate. The Saga handles events, facts that something has happened. Ignoring that fact because somebody whom initiated the operations which led to this fact doesn't have the rights doesn't resolve the point that it still has happened. Added, you are indeed not guaranteed on the timing of the Saga at all. Maybe your Saga deals with historical events, meaning it is completely out of scope.
If possible within your system, I would regard any command the Saga wants to publish as being sent by a "system user". The Saga is not something your users (which have specific roles) will directly influence; it is all indirect. The Saga is internal to your system, hence it is the system describing the intent to perform an operation.
That's my two cents to the situation, hope this helps you out #Vincent!

How to handle events processing time between services

Let's say we have two services A and B. B has a relation to A so it needs to know about the existing entities of A.
Service A publishes events every time an entity is created or updated. Service B subscribes to the events published by A and therefore knows about the entities existing in service A.
Problem: The client (UI or other micro services) creates a new entity 'a' and right away creates a new entity 'b' with a reference to 'a'. This is done without much delay so what happens if service B did not receive/handle the event from B before getting the create request with a reference to 'b'?
How should this be handled?
Service B must fail and the client should handle this and possibly do retry.
Service B accepts the entity and over time expect the relation to be fulfilled when the expected event is received. Service B provides a state for the entity that ensures it cannot be trusted before the relation have been verified.
It is poor design that the client can/has to do these two calls in the same transaction. The design should be different. How?
Other ways?
I know that event platforms like Kafka ensures very fast event transmittance but there will always be a delay and since this is an asynchronous process there will be kind of a race condition.
What you're asking about falls under the general category of bridging the gap between Eventual Consistency and good User Experience which is a well-documented challenge with a distributed architecture. You have to choose between availability and consistency; typically you cannot have both.
Your example raises the question as to whether service boundaries are appropriate. It's a common mistake to define microservice boundaries around Entities, but that's an anti-pattern. Microservice boundaries should be consistent with domain boundaries related to the business use case, not how entities are modeled within those boundaries. Here's a good article that discusses decomposition, but the TL;DR; is:
Microservices should be verbs, not nouns.
So, for example, you could have a CreateNewBusinessThing microservice that handles this specific case. But, for now, we'll assume you have good and valid reasons to have the services divided as they are.
The "right" solution in your case depends on the needs of the consuming service/application. If the consumer is an application or User Interface of some sort, responsiveness is required and that becomes your overriding need. If the consumer is another microservice, it may well be that it cares more about getting good "finalized" data rather than being responsive.
In either of those cases, one good option is a facade (aka gateway) service that lives between your client and the highly-dependent services. This service can receive and persist the request, then respond however you'd like. It can give the consumer a 200 - OK response with an endpoint to call back to check status of the request - very responsive. Or, it could receive a URL to use as a webhook when the response is completed from both back-end services, so it could notify the client directly. Or it could publish events of its own (it likely should). Essentially, you can tailor the facade service to provide to as many consumers as needed in the way each consumer wants to talk.
There are other options too. You can look into Task-Based UI, the Saga pattern, or even just Faking It.
I think you would like to leverage the flexibility of a broker and the confirmation of a synchronous call . Both of them can be achieved by this

DDD, Domain Services and Events

To work with domain events, Jimmy Bogart proposed a method for storing events in aggregates.
This, from my point of view, is a very convenient approach. However, what about the case of a domain event in the domain service?
Domain Service should not have a state (stateless). In this case, in theory, the IDispatcher event dispatcher must be injected into the constructor of such a service.
To avoid introducing into the domain service of the event dispatcher, the suggested alternative approaches are correct:
Saving in the domain service of events of the last operation. However, this will violate the principle of stateless for the domain service.
Return the list of events from the service method based on the
results of the operation (in the return method or in another way,
depending on the capabilities of the programming language).
Note: that post was written about five years ago. You may want to review his more recent (and more detailed): Life Beyond Distributed Transactions: An Apostate's Implementation
Domain Service should not have a state
Right - and for this reason, it is very suspicious that you would want to assign responsibility for domain events in the domain service.
You might use a domain service to calculate events for the aggregate, but the storage would still belong to the aggregate structure itself. So that would probably look like a function (or, if you prefer, a method on the domain service) that accepts some arguments provided by the aggregate and returns events.

Saga Choreography implementation problems

I am designing and developing a microservice platform based on the specifications of
The entire framework integrates through socket thus removing the overhead of multiple HTTP requests (like most REST APIs).
A service registry host receives the registry of multiple microservice hosts, each microservice is responsible for a domain of the business. Another host we call a router (or API gateway) is responsible for exposing the microservices for consumption by third parties.
We will use the structure of Sagas (in choreography style) to distribute the requisitions, so we have some doubts:
Should a microservice issue the event in any process manager or should it be passed directly to the next microservice responsible for the chain of events? (the same logic applies to rollback)
Who should know how to build the Saga chain of events? The first microservice that receives a certain work or the router?
If an event needs to pass a very large volume of data to the next Saga event, how is this done in terms of the request structure? Is it divided into multiple Sagas for example (as a result pagination type)?
I think the main point is that in this router and microservice structure, who is responsible for building the Sagas and propagating their events.
The article Patterns for Microservices — Sync vs. Async does a great job defining many of the terms used here and has animated gifs demonstrating sync vs. async and orchestrated vs. choreographed as well as hybrid setups.
I know the OP answered his own question for his use case, but I want to try and address the questions raised a bit more generally in lieu of the linked article.
Should a microservice issue the event in any process manager or should it be passed directly to the next microservice responsible for the chain of events?
To use a more general term, a process manager is an orchestrator. A concrete implementation of this may involve a stateful actor that orchestrates a workflow, keeping track of the progress in some way. Since a saga is workflow itself (composed of both forward and compensating actions), it would be the job of the process manager to keep track of the state the saga until completion (success or failure). This typically involves the actor sending synchronous* calls to services waiting for some result before going to the next step. Parallel operations can of course be introduced and what not, but the point is that this actor dictates the progression of the saga.
This is fundamentally different from the choreography model. With this model there is no central actor keeping track of the state of a saga, but rather the saga progresses implicitly via the events that each step emits. Arguably, this is a more pure case of an event-driven model since there is no coordination.
That said, the challenge with this model is observing the state at any given point in time. With the orchestration model above, in theory, each actor could be queried for the state of the saga. In this choreographed model, we don't have this luxury, so in practice a correlation ID is added to every message corresponding to (in this case) a saga. If the messages are queryable in some way (the event bus supports it or through some other storage means), then the messages corresponding to a saga could be queried and the saga state could be reconstructed.. (effectively an event sourced modeled).
Who should know how to build the Saga chain of events? The first microservice that receives a certain work or the router?
This is an interesting question by itself and one that I have been thinking about quite a lot. The easiest and default answer would be.. hard code the saga plans and map them to the incoming message types. E.g. message A triggers plan X, message B triggers plan Y, etc.
However, I have been thinking about what a control plane might look like that manages these plans and provides the mechanism for pushing changes dynamically to message handlers and/or orchestrators dynamically. The two specific use cases in mind are changes in authorization policies or dynamically adding new steps to a plan.
If an event needs to pass a very large volume of data to the next Saga event, how is this done in terms of the request structure? Is it divided into multiple Sagas for example (as a result pagination type)?
The way I have approached this is to include references to the large data if these are objects such as a file or something. For data that are inherently streams themselves, a parallel channel could be referenced that a consumer could read from once it receives the message. I think the important distinction here is to decouple thinking about the messages driving the workflow from where the data is physically materialized which depends on the data representation.
For microservices, every microservice should be responsible for its domain business.
Should a microservice issue the event in any process manager or should it be passed directly to the next microservice responsible for the chain of events? (the same logic applies to rollback)
All events are not passed to the next microservice, but are published, then all microservices interested in the events should subscribe to them.
If there is rollback, you should consider orchestration.
Who should know how to build the Saga chain of events? The first microservice that receives a certain work or the router?
The microservice who publish the event will certainly know how to build it. There are no chain of events, because every microservice interested in the event will subscribe it separately.
If an event needs to pass a very large volume of data to the next Saga event, how is this done in terms of the request structure? Is it divided into multiple Sagas for example (as a result pagination type)?
Only publish the data others may be interested, not all. In most cases, the data are not large, and message queue can handle them efficiently

Remote persistent views with Lagom

In a classical microservice architecture, you have relevant domain events published on some messaging system which allows other parts of the system to react.
Now imagine you have three microservices: Customers, Orders and Recommendation. The Recommendation microservice needs information from Customers and Orders to provide its functionality, such as the list of all customers and all the orders, which is going to be analyzed from some machine learning algorithm. Now, you need to have the state of Customers "join" Orders on the Recommandation microservice:
You have the Recommandation microservice listen to domain events published by Customers and Orders and built its own state. This leads to logic duplication since you probably have that same logic inside Customers and Orders already
On each relevant domain message from Customers and Orders, you just go to them and ask the state of a specific customer or order. This works fine, however if you have N services rather than just one which needs to build a materialized view, you will cause a big load on Customers and Orders
You get Customers and Orders themselves publish "heavy-weight" events (not domain events) that allows any other microservice to build a materialized view without processing domain events. This allows you both a) not to duplicate the logic b) not to keep asking the same information
Has pattern n.3 some drawbacks we couldn't figure out and if not, how do you implement it in Lagom?
I will try to explain a few more bits in the hope to give you some more perspective on that matter and how you can achieve it in a reliable way in Lagom.
We have a few concepts that we must keep in mind. The most important one which is the source of all is Event Sourcing itself. Event Sourcing means that any State in the system has its source in Events.
The first State that we will deal with is the State of the PersistentEntity. This State is prominent because, together with the Command and Event Handler, it defines the consistency boundary of your model.
But there other States in the system. Actually, we can create as much as we want because we have the Event Journal. A read-model is also a State and it’s also generated from the events.
There are many reasons why you shouldn’t publish the State of the PersistentEntity to other systems. The first one being a matter of avoiding coupling. You don’t want your data to leak to other services. That’s all about having an anti-corruption layer (ACL).
So, from here we could say: before publishing Order and Customer to Recommendation Service, I will transform it to OrderView and CustomerView (ACL 101).
The question now is when will you do it? If you try to publish it in Kafka after you have handled a command, you don’t have any guarantee that the State will be published. There are no XA transactions between the event journal and the Kafka topic. So, there is a chance that the events are persisted, but for some reason, the State is not published in Kafka.
If you want data to get out of a service in a reliable way and without creating coupling between services, you have the following options:
Use the broker API and publish the events to a topic. You should not publish the events as they are, but transform them into the format of your external API (ACL).
Use a read-side processor to generate a view of it, again the external API format you want to make available. If you want, you can publish that ViewState to a topic so other services can consume it directly.
That said, there is nothing wrong in publishing something in a topic that is not a real event, but some derived State. The problem is how you can guarantee that it is effectively published. Doing that from inside the PersistentEntity is risky because you have at-most-once semantics. The most reliable way of doing it is a read-side process that gives you at-least-once semantics.
Further comments inline...
Listen to domain events from customer and orders and rebuild the state
in the recommandation service. This is a horrible idea because you
would need to duplicate the logic that handles events across different
bounded context
That's not a horrible idea. That's how you make your services independent from each other. The logic that you will need to implement to consume the events are not the same. As you said, it's a different bounded context, as such it only gets what it needs.
Leaking the State from a BC to another is more problematic for the reasons I mentioned above (anti-corruption layer).
To achieve decoupling you do need more coding and there is nothing wrong with that. At the end of the day, the reason for building microservices is to avoid coupling and be able to let the services evolve and scale without interfering with each other. There is a price to pay for that and the price is to write more code. You need to evaluate the thread-offs.
You can consume your own events, produce an OrderView and CustomerView and publish into Kafka, but that's the same as consuming the events directly on the Recommendation Service.
Note that you also need to store OrderView and CustomerView somewhere in the Recommendation Service. So you end up storing it three times. On the original service (view table), in Kafka and in the Recommendation Services.
That's why publishing events in a topic is the best option to propagate data between services.
Every time we receive a domain event from customers or orders, go to
them and ask them the state. This is horrible because if you have more
than one microservice that needs their state, you will end up
producing load on customers and orders
That is indeed a horrible idea because you will make the Recommendation Service be dependent on the other two services. If Order or Customer is down, the Recommendation will be down as well. That's what a broker helps to solve.
Have customers and orders not only publish events but also state and
having all the services that need to build materialized views listen
the state they need How do you apply the last pattern with Lagom? We
found no way to listen to state changes, just to events. One solution
we considered implied publishing with pubSub the state in the onEvent
handler of a persistent entity but I am not sure this is the right
place to make it happen.
Using pubSub in the onEvent handler is the worst solution of all. For the following reasons:
pubSub has at-most-once sematincs (see comments above)
Event handlers are called many times. Whenever you re-hydrate an Entity, the events are replayed and the the event handlers will be used for that. Which mean that you will re-publish the state each time. Actually, you would solve the at-most-once pubSub problem, but not the way you might expect/desire.
You could use the afterPersist callback for that, but that's not reliable neither because pubSub is at-most-once.
PubSub inside a PersistentEntity should not be used for something that you need to be reliable. It's a best-effort capability, that's all.
