Is it possible to define a single saga which will process many messages - masstransit

My team is considering if we can use mass transit as a primary solution for sagas in RabbitMq (vs NServiceBus). I admit that our experience which solution like masstransit and nserviceBus are minimal and we have started to introduce messaging into our system. So I sorry if my question will be simple or even stupid.
However, when I reviewed the mass transit documentation I noticed that I am not sure if that is possible to solve one of our cases.
The case looks like:
One of our components will produce up to 100 messages which will be "sent" to queue. These messages are a result of a single operation in a system. All of the messages will have the same Correlated Id and our internal publication id (same too).
1) is it possible to define a single instance saga (by correlated id) which will wait until it receives all messages from a queue and then process them as a single batch?
2) otherwise, is there any solution to ensure all of the sent messages was processed? (Consistency batch?) I assume that correlated Id will serve as a way to found an existing saga instance (singleton). In the ideal case, I would like to complete an instance of a saga When the system will process every message which belongs to a single group (to one publication)
I look at CompositeEvent too but I do not sure if I could use it to "ensure" that every message was processed and then I would let to complete saga for specific correlated Id.
Can you explain how could it be achieved? And into what mechanism I should look at in order to correlated id a lot of messages with the same id to the single saga and then complete if all of msg will be consumed?
Thank you in advance for any response

What you describe is how correlation by id works. It is like that out of the box.
So, in short - when you configure correlation for your messages correctly, all messages with the same correlation id will be handled by the same saga instance.
Concerning the second question - unless you publish a separate event that would inform the saga about how messages it should expect, how would it know that? You can definitely schedule a long timeout, attempting and assuming that within the timeout all the messages will be received by the saga, but it's not reliable.
Composite events won't help here since they are for messages with different types to be handled as one when all of them arrive and it doesn't count for the number of messages of each type. It just waits for one message of each type.

The ability to receive a series of messages and then operate on them in a batch is a common case, so much so that there is a sample showing how to do just that:
Batch Sample
Each saga instance has a unique correlation identifier, and as long as those messages can be correlated to that single instance, MassTransit will manage the concurrency (either optimistic or pessimistic, and depending upon the saga storage engine).
I'd suggest reviewing the state machine in the sample, and seeing how that compares to your scenario.

Related

CQRS - out of order messages

Suppose we have 3 different services producing events, each of them publishing to its own event store.
Each of these services consumes other producers services events.
This because each service has to process another service's events AND to create its own projection. Each of the service runs on multiple instances.
The most straight forward way to do it (for me) was to put "something" in front of each ES which is picking events and publishing (pub/sub) them in queues of every other service.
This is perfect because every service can subscribe to each topics it likes, while the event publisher is doing the job and if a service is unavailable events are still delivered. This seems to me to guarantee high scalability and availability.
My problem is the queue. I can't get an easily scalable queue that guarantees ordering of the messages. It actually guarantees "slightly out of order" with at-least once delivery: to be clear, it's AWS SQS.
So, the ordering problems are:
No order guaranteed across events from the same event stream.
No order guaranteed across events from the same ES.
No order guaranteed across events from different ES (different services).
I though I could solve the first two problems just by keeping track of the "sequence number" of the events coming from the same ES.
This would be done by tracking the last sequence number of each topic from which we are consuming events
This should be easy for reacting to events and also building our projection.
Then, when I pop an event from the queue, if the eventSequenceNumber > previousAppliedEventSequenceNumber + 1 i renqueue it (or make it invisible for a certain time).
But it turns out that using this solution, it will destroy performances when events are produced at high rates (I can use a visibility timeout or other stuff, the result should be the same).
This because when I'm expecting event 10 and I ignore event 11 for a moment, I should ignore also all events (from ES) with sequence numbers coming after that event 11, until event 11 shows up again and it's effectively processed.
Other difficulties were:
where to keep track of the event's sequence number for build the projection.
how to keep track of the event's sequence number for build the projection so that when appling it, I have a consistent lastSequenceNumber.
What I'm missing?
P.S.: for the third problem think at the following scenario. We have a UserService and a CartService. The CartService has a projection where for each user keeps track of the products in the cart. Each cart's projection must have also user's name and other info's that are coming from the UserCreated event published from the UserService. If UserCreated comes after ProductAddedToCart the normal flow requires to throw an exception because the user doesn't exist yet.
What I'm missing?
You are missing flow -- consumers pull messages from sources, rather than having sources push the messages to the consumers.
When I wake up, I check my bookmark to find out which of your messages I read last, and then ask you if there have been any since. If there have, I retrieve them from you in order (think "document message"), also writing down the new bookmarks. Then I go back to sleep.
The primary purpose of push notifications is to interrupt the sleep period (thereby reducing latency).
With SQS acting as a queue, the idea is that you read all of the enqueued messages at once. If there are no gaps, then you can order the collection then start processing them and acking them. If there are gaps, you either wait (leaving the messages in the queue) or you go to the event store to fetch copies of the missing messages.
There's no magic -- if the message pipeline is promising "at least once" delivery, then the consumers must take steps to recognize duplicate messages as they arrive.
If UserCreated comes after ProductAddedToCart the normal flow requires to throw an exception because the user doesn't exist yet.
Review Race Conditions Don't Exist, by Udi Dahan: "A microsecond difference in timing shouldn’t make a difference to core business behaviors."
The basic issue is assuming we can get messages IN ORDER...
This is a fallacy in distributed computing...
I suggest you design for no message ordering in your system.
As for your issues, try and use UTC time in the message body/header created by the originator and try and work around this data point. Sequence numbers are going to fail unless you have a central deterministic sequence creator (which will be a non-scalable, single point of failure).
Using Sagas/State machine is a path that can help to make sense of (business) events ordering.

How to handle side effects based on multiple events in a message driven microservice system?

we are currently working in a message driven Microservice environment and some of our messages/events are event sourced (using Apache Kafka). Now we are struggling with implementing more complex business requirements, were we have to take multiple events into account to create new events and side effects.
In the current situation we are working with devices that can produce errors and we already process them and have a single topic which contains ERROR_OCCURRED and ERROR_RESOLVED events (so they are in order). We also make sure, that all messages regarding a specific device always go onto the same partition. And both messages share an ID that identifies that specific error incident. We already have a projection that consumes those events and provides an API for our customers, s.t. they can see all occurred errors and their current state.
Now we have to deal with the following requirement:
Reporting Errors
We need a push system that reports errors of devices to our external partners, but only after 15 minutes and if they have not been resolved in that timeframe. Our first approach was to consume all ERROR_RESOLVED events, store the IDs and have another consumer that is handling the ERROR_OCCURRED events in a delayed fashion (e.g. by only consuming the next ERROR_OCCURRED event on the topic if its timestamp is at least 15 minutes old). We would then be able to know if that particular error has already been resolved and does not need to be reported (since they share a common ID with the corresponding ERROR_RESOLVED event). Otherwise we send an HTTP request to our external partner and create an ERROR_REPORTED event on a new topic. Is there any better approach for delayed and conditional message processing?
We also have to take the following special use cases into account:
Service restarts: currently we are planning to keep the list of resolved errors in memory, so if a service restarts, that list has to be created from scratch. We could just replay the ERROR_RESOLVED messages, but that may take some time and in that time no ERROR_OCCURRED events should be processed because that may result in reporting errors that have been resolved in less then 15 minutes, but we are just not aware of it. Are there any good practices regarding replay vs. "normal" processing?
Scaling: we may increase or decrease the number of instances of our service at any time, so the partition assignment may change during runtime. That should not be a problem if we create a consumer group for each service instance when consuming the ERROR_RESOLVED events, s.t. every instance knows all resolved errors while still only handling the ERROR_OCCURRED events of its assigned partitions (in another consumer group which is shared by all instances). Is there a better approach for handling partition reassignment and internal state?
Thanks in advance!
For side effects, I would record all "side" actions in the event store. In your particular example, when it is time to send a notification, I would call SEND_NOTIFICATION command that emit NOTIFICATION_SENT event. These events would be processed by some worker process that does actual HTTP request.
Actually I would elaborate this even furter, since notifications could fail, so I would have, say, two events NOTIFICATION_REQUIRED, and NORIFICATION_SENT, so we can retry failed notifications.
And finally your logic would be "if error was not resolved in 15 minutes and notification was not sent - send a notification (or just discard if it missed its timeframe)"

Listening on multiple events

How to deal with correlated events in an Event Driven Architecture? Concretely, what if multiple events must be triggered in order for some action to be performed. For example, I have a microservice that listens to two events foo and bar and only performs an action when both of the events arrive and have the same correlation id.
One way would be to keep an internal data structure inside the microservice that does the book keeping and when everything is satisfied an appropriate action is triggered. However, the problem with this approach is that the microservice is not immutable anymore.
Is there a better approach?
A classic example is where an order comes in at sales and an event is published. Both Finance and Shipping are subscribed to the event, but shipping is also subscribed to the event coming from finance.
The funny thing is that you have no idea on the order in which the messages arrive. The event from sales might cause a technical error, because the database is offline. It might get queued again or end up in an error queue for operations to retry it. In the meantime the event from finance might arrive. So theoretically
the event from sales should arrive first and then the finance event, but in practice it can be the other way around.
There are a number of solutions here, but I've never liked the graphical ones. As a .NET developer I've used K2 and Windows Workflow Foundation in the past, but the solutions most flexible are created in code, not via a graphical interface.
I currently would use NServiceBus or MassTransit for this. On a sidenote, I currently work at Particular Software and we make NServiceBus. NServiceBus has Sagas for this kind of work (documentation) and you can also read on my weblog about a presentation, incl. code on GitHub.
The term saga is kind of loaded, but it basically handles long running (business) processes. Gregor Hohpe calls it a Process Manager (link).
To summarize what sagas do : they are instantiated by incoming messages and have state. Incoming messages are bound/dispatched to a specific saga instance based on a correlationid, for example a customer id or order id. Once the message (event) is processed, state is stored until a new message arrives, or until the code marks the saga as completed and the state is removed from storage.
As said, in the .NET world MassTransit and NServiceBus support this, but there are most likely alternatives in other environments.
If i understand correctly, it looks like you need a CEP ( complex event processor), like ws02 cep or other , which does exactly that.
cep's can aggregate events and perform actions when certain conditions
have been met.

An event store could become a single point of failure?

Since a couple of days I've been trying to figure it out how to inform to the rest of the microservices that a new entity was created in a microservice A that store that entity in a MongoDB.
I want to:
Have low coupling between the microservices
Avoid distributed transactions between microservices like Two Phase Commit (2PC)
At first a message broker like RabbitMQ seems to be a good tool for the job but then I see the problem of commit the new document in MongoDB and publish the message in the broker not being atomic.
Why event sourcing? by eventuate.io:
One way of solving this issue implies make the schema of the documents a bit dirtier by adding a mark that says if the document have been published in the broker and having a scheduled background process that search unpublished documents in MongoDB and publishes those to the broker using confirmations, when the confirmation arrives the document will be marked as published (using at-least-once and idempotency semantics). This solutions is proposed in this and this answers.
Reading an Introduction to Microservices by Chris Richardson I ended up in this great presentation of Developing functional domain models with event sourcing where one of the slides asked:
How to atomically update the database and publish events and publish events without 2PC? (dual write problem).
The answer is simple (on the next slide)
Update the database and publish events
This is a different approach to this one that is based on CQRS a la Greg Young.
The domain repository is responsible for publishing the events, this
would normally be inside a single transaction together with storing
the events in the event store.
I think that delegate the responsabilities of storing and publishing the events to the event store is a good thing because avoids the need of 2PC or a background process.
However, in a certain way it's true that:
If you rely on the event store to publish the events you'd have a
tight coupling to the storage mechanism.
But we could say the same if we adopt a message broker for intecommunicate the microservices.
The thing that worries me more is that the Event Store seems to become a Single Point of Failure.
If we look this example from eventuate.io
we can see that if the event store is down, we can't create accounts or money transfers, losing one of the advantages of microservices. (although the system will continue responding querys).
So, it's correct to affirmate that the Event Store as used in the eventuate example is a Single Point of Failure?
What you are facing is an instance of the Two General's Problem. Basically, you want to have two entities on a network agreeing on something but the network is not fail safe. Leslie Lamport proved that this is impossible.
So no matter how much you add new entities to your network, the message queue being one, you will never have 100% certainty that agreement will be reached. In fact, the opposite takes place: the more entities you add to your distributed system, the less you can be certain that an agreement will eventually be reached.
A practical answer to your case is that 2PC is not that bad if you consider adding even more complexity and single points of failures. If you absolutely do not want a single point of failure and wants to assume that the network is reliable (in other words, that the network itself cannot be a single point of failure), you can try a P2P algorithm such as DHT, but for two peers I bet it reduces to simple 2PC.
We handle this with the Outbox approach in NServiceBus:
http://docs.particular.net/nservicebus/outbox/
This approach requires that the initial trigger for the whole operation came in as a message on the queue but works very well.
You could also create a flag for each entry inside of the event store which tells if this event was already published. Another process could poll the event store for those unpublished events and put them into a message queue or topic. The disadvantage of this approach is that consumers of this queue or topic must be designed to de-duplicate incoming messages because this pattern does only guarantee at-least-once delivery. Another disadvantage could be latency because of the polling frequency. But since we have already entered the eventually consistent area here this might not be such a big concern.
How about if we have two event stores, and whenever a Domain Event is created, it is queued onto both of them. And the event handler on the query side, handles events popped from both the event stores.
Ofcourse every event should be idempotent.
But wouldn’t this solve our problem of the event store being a single point of entry?
Not particularly a mongodb solution but have you considered leveraging the Streams feature introduced in Redis 5 to implement a reliable event store. Take a look this intro here
I find that it has rich set of features like message tailing, message acknowledgement as well as the ability to extract unacknowledged messages easily. This surely helps to implement at least once messaging guarantees. It also support load balancing of messages using "consumer group" concept which can help with scaling the processing part.
Regarding your concern about being the single point of failure, as per the documentation, streams and consumer information can be replicated across nodes and persisted to disk (using regular Redis mechanisms I believe). This helps address the single point of failure issue. I'm currently considering using this for one of my microservices projects.

Approach for taking action on reception of two different JMS messages

Say I have one JMS message FooCompleted
{"businessId": 1,"timestamp": "20140101 01:01:01.000"}
and another JMS message BazCompleted
{"businessId": 1,"timestamp": "20140101 01:02:02.000"}
The use case is that I want some action triggered when both messages have been received for the business id in question - essentially a join point of reception of the two messages. The two messages are published on two different queues and order between reception of FooCompleted and BazCompleted may change. In reality, I may need to have join of reception of several different messages for the businessId in question.
The naive approach was that to store the reception of the message in a db and check if message(s) its dependent join arm(s) have been received and only then kick off the action desired. Given that the problem seems generic enough, we were wondering if there is a better way to solve this.
Another thought was to move messages from these two queues into a third queue on reception. The listener on this third queue will be using a special avataar of DefaultMessageListenerContainer which overrides the doReceiveAndExecute to call receiveMessage for all outstanding messages in the queue and adding messages back to the queue whose all dependent messages have not yet arrived - the remaining ones will be acknowledged and hence removed. Given that the quantum of messages will be low, probing the queue over and adding messages again should not be a problem. The advantage would be avoiding the DB dependency and the associated scaffolding code. Wanted to see if there is something glaringly bad with this
Gurus, please critique and point out better ways to achieve this.
Thanks in advance!
Spring Integration with a JMS message-driven adapter and an aggregator with custom correlation and release strategies, and a peristent (JDBC) message store will provide your first solution without writing much (or any) code.

Resources