Looking For A Scalable PubSub Solution Or Alternative - performance

I'm currently looking for the best architecture for an IM app I'm trying to build.
The app consists of channels each having a couple thousands of subscribed users. Each user is subscribed only to one channel at a time and is able to publish and read from that channel. Users may move rapidly between channels.
I initially considered using the XMPP PubSub (via Ejabbered or MongooseIM) but as far as I understand it was added as an afterthought and is not very scalable.
I also thought about using using a message queue protocol like AMPQ but I'm not sure if that's what I'm looking for from the IM aspect.
Is my concern regarding the XMPP PubSub justified? And if so, do you know of a better solution?

Take a look at Redis and Kafka. Both are scalable and performant.

I imagined below primary usecases for above IM application based on your inputs.
**
Usecases
**
Many new users keep registering with system and subscribing to one
of the channels
Many existing users changing their subscription from one channel to
other channel
Many existing users keep publishing messages to channels
Many existing users keep receiving messages as subscribers
XMPP is natural fit for 3rd and 4th usecases. "ejabbered" is one of proven highly scalable platform to go ahead.
In case 2nd usecase, You probably may have logic some thing like this.
- a)update channel info of the user in DB
- b)make him listen to new channel
- c)change his publishing topic to other channel...so on
When ever you need to do multiple operations, I strongly recommend to use "KAFKA" to perform above operations in async manner
In case of 1st usecase, Provide registration through rest APIs.So that registration can be done from any device.While registering an user,You may have many operations as follows.
- 1) register user in DB
- 2) create internally IM account
- 3) send email OR SMS for confirmation...so on
Here also perform 1st operation as a part of rest API service logic. Perform 2nd and 3rd operations in async manner using KAFKA. That means your service logic perform 1st operation in sync manner and raise an event to KAFKA. Every consumer will handle 2nd and 3rd operations in async manner.
System could scale well if all layers/subsystems can scale well. In that perspective, Below tech stack may help you scale well.
REST APIS + KAFKA + EJABBERED(XMPP)

Related

Message validation for async messaging systems

I'm looking for the best approach as to how I can go about doing validation of a message as its enqueued in async messaging based systems.
Scenario:
Let's say we have a two services A and B where they need to interact with each other asynchronously. And we have a queue between them lets say SQS which will receive the message from A, which will be then polled by service B.
Ask:
How can I validate the message like doing schema validation as its enqueued to SQS since currently SQS doesnt have any in-built schema validation functionality like we have for JMS
Couple of options I can think of:
Have a validation layer maybe a small service sitting between A and SQS queue but not sure how feasible this will be
Use some sort of MOM like AWS Eventbridge between A and SQS queue as it has functionalities to validate schemas as well as it could act as a central location to store all the schemas
Have a rest endpoint in B that'll do the validation and have SQS sitting behind B but then this removes the async communication b/w A and B
Would appreciate any inputs on the above ask and how it could be resolved via best practices.
I'd recommend to read about the Mediator Topology of Event-Driven architecture style. From the details that you shared, it sounds to me that putting a "Mediator Service" called M for example, which will get messages from A, make the required validations, and then will send the message to SQS on its way to B - will achieve what you want.
Validation of the message payloads can occur on the "way in" or the "way out" depending on your use case and scaling needs. Most scenarios will aim to prevent invalid data getting too far downstream i.e. you will validate before putting data into SQS.
However, there are reasons you may choose to validate the message payload while reading from the queue. For example, you may have many services adding messages, those messages may have multiple "payload versions" over time, different teams could be building services (frontend and backend) etc. Don't assume everything and everyone is consistent.
Assuming that the payload data in SQS is validated and can be processed by a downstream consumer without checking could cause lots of problems and/or breaking scenarios. Always check your data in these scenarios. In my experience it's either the number one reason, or close to it, for why breaking changes occur.
Final point: with event-driven architectures the design decision points are not just about the processing/compute software services but also about the event data payloads themselves which also have to be designed properly.

What would be the right ZMQ Pattern?

I am trying to build a ZeroMQ pattern where,
There can be many clients connecting to a single server endpoint
Server will distribute incoming client tasks to available workers (will be mapped to the number of cores on the server)
These tasks are long running (in hours) and need to perform a lot of local I/O
During each task execution (iteration) there will be data/messages (potentially in order of [GB]s) sent back and forth between the client and the server worker
Client and server workers need to know if there are failures/errors on the peer side, so that they can recover (retry) or shutdown gracefully and try later
Based on the above, I presume that the ROUTER/DEALER pattern would be useful. PUB/SUB is discarded as I need to know if the peer fails.
I tried using various combinations of the ROUTER/DEALER pattern but I am unable to ensure that multiple messages from a client reach the same worker within an iteration. I understand that I need to implement a broker/forwarder/device that routes the incoming messages to the right recipient/handler/worker. But I am unable to map the frontend and backend sockets in the broker. I am looking at MajorDomo pattern, but I guess there has to be a simpler broker model that could just route the messages to the assigned worker. (not really get into services)
I am looking for some examples, if there are any or any guidance on what I may be missing. I am trying to build this in Golang.
Q : "What would be the right ZMQ Pattern?"
Based on the complex composition of all the requirements posted under items 1 - 5, I dare to say, The Right would be NOT to use a single one of the standard, built-in, ZeroMQ trivial primitive Communication Archetype Patterns, but to rather create a multi-layered application-specific composition of a ( M + N + 1 hot-standby robust-enough?) (self-resilient?) Signalling-Messaging infrastructure, that covers all your current ( and possibly extensible for any future one ) application-level requirements, like depicted here for a way simpler distributed-computing use-case, where but a trivial remote-SigKILL was implemented.
Yes, the best would be to create ( and maintain ) your own formalised signalling, that the application level can handle and interact across -- like the heart-beating for detecting dead-worker(s) + permitting to re-instate such failed jobs right on-detected failures (most probably re-located and/or re-scheduled to take place & respective resources not statically pre-mapped, but where physically most feasible at the re-instating moment of time - so even more telemetry signalling will help you decide about the re-instating of the such failed micro-jobs).
ZeroMQ is a fabulous framework right for such complex signalling and messaging hierarchies, so your System Architect's imagination is the only ceiling in this concept.
ZeroMQ will take the rest and do all the hard work nice and easily.

How to get data a ZMQ_PUB service?

Can I publisher service receive data from an external source and send them to the subscribers?
In the wuserver.cpp example, the data are generated from the same script.
Can I write a ZMQ_PUBLISHER entity, which receives data from external data source / application ... ?
In this affirmation:
There is one more important thing to know about PUB-SUB sockets: you do not know precisely when a subscriber starts to get messages. Even if you start a subscriber, wait a while, and then start the publisher, the subscriber will always miss the first messages that the publisher sends. This is because as the subscriber connects to the publisher (something that takes a small but non-zero time), the publisher may already be sending messages out.
Does this mean, that a PUB-SUB ZeroMQ pattern is performed to a best effort - UDP style?
Q1: Can I write a ZMQ_PUBLISHER entity, which receives data from external data source/application?
A1: Oh sure, this is why ZeroMQ is so helping us in designing smart distributed-systems. Just imagine the PUB-side process to also have other { .bind() | .connect() }-calls, so as to establish such other links to data-feeder(s), and you are done to operate the wished to have scheme. In distributed-systems this gives you a new freedom to smart integrate heterogeneous systems to talk to each other in a very efficient way.
Q2:Does this mean, that a PUB-SUB ZeroMQ pattern is performed to a best effort - UDP style?
A2: No, it has another meaning. The newly declared subscriber entities at some uncertain moment start to negotiate their respective subscription-topic filtering and such a ( distributed ) process takes some a-priori unknown time. Unless until the new / changed topic-filter policy was established, there is nothing to go into the SUB-side exgress interface to meet a .recv()-call, so no one can indeed tell, when that will get happened, can he?
On a higher level, there is another well known dichotomy of ZeroMQ -- Zero-Warranty Principle -- expect to either get delivered a complete message or none at all, which prevents the framework users from a need to handle any kind of damaged / inconsistent message-payloads. Either OK, or None. That's a great warranty. The more for distributed-systems.

Remote persistent views with Lagom

In a classical microservice architecture, you have relevant domain events published on some messaging system which allows other parts of the system to react.
Now imagine you have three microservices: Customers, Orders and Recommendation. The Recommendation microservice needs information from Customers and Orders to provide its functionality, such as the list of all customers and all the orders, which is going to be analyzed from some machine learning algorithm. Now, you need to have the state of Customers "join" Orders on the Recommandation microservice:
You have the Recommandation microservice listen to domain events published by Customers and Orders and built its own state. This leads to logic duplication since you probably have that same logic inside Customers and Orders already
On each relevant domain message from Customers and Orders, you just go to them and ask the state of a specific customer or order. This works fine, however if you have N services rather than just one which needs to build a materialized view, you will cause a big load on Customers and Orders
You get Customers and Orders themselves publish "heavy-weight" events (not domain events) that allows any other microservice to build a materialized view without processing domain events. This allows you both a) not to duplicate the logic b) not to keep asking the same information
Has pattern n.3 some drawbacks we couldn't figure out and if not, how do you implement it in Lagom?
I will try to explain a few more bits in the hope to give you some more perspective on that matter and how you can achieve it in a reliable way in Lagom.
We have a few concepts that we must keep in mind. The most important one which is the source of all is Event Sourcing itself. Event Sourcing means that any State in the system has its source in Events.
The first State that we will deal with is the State of the PersistentEntity. This State is prominent because, together with the Command and Event Handler, it defines the consistency boundary of your model.
But there other States in the system. Actually, we can create as much as we want because we have the Event Journal. A read-model is also a State and it’s also generated from the events.
There are many reasons why you shouldn’t publish the State of the PersistentEntity to other systems. The first one being a matter of avoiding coupling. You don’t want your data to leak to other services. That’s all about having an anti-corruption layer (ACL).
So, from here we could say: before publishing Order and Customer to Recommendation Service, I will transform it to OrderView and CustomerView (ACL 101).
The question now is when will you do it? If you try to publish it in Kafka after you have handled a command, you don’t have any guarantee that the State will be published. There are no XA transactions between the event journal and the Kafka topic. So, there is a chance that the events are persisted, but for some reason, the State is not published in Kafka.
If you want data to get out of a service in a reliable way and without creating coupling between services, you have the following options:
Use the broker API and publish the events to a topic. You should not publish the events as they are, but transform them into the format of your external API (ACL).
Use a read-side processor to generate a view of it, again the external API format you want to make available. If you want, you can publish that ViewState to a topic so other services can consume it directly.
That said, there is nothing wrong in publishing something in a topic that is not a real event, but some derived State. The problem is how you can guarantee that it is effectively published. Doing that from inside the PersistentEntity is risky because you have at-most-once semantics. The most reliable way of doing it is a read-side process that gives you at-least-once semantics.
Further comments inline...
Listen to domain events from customer and orders and rebuild the state
in the recommandation service. This is a horrible idea because you
would need to duplicate the logic that handles events across different
bounded context
That's not a horrible idea. That's how you make your services independent from each other. The logic that you will need to implement to consume the events are not the same. As you said, it's a different bounded context, as such it only gets what it needs.
Leaking the State from a BC to another is more problematic for the reasons I mentioned above (anti-corruption layer).
To achieve decoupling you do need more coding and there is nothing wrong with that. At the end of the day, the reason for building microservices is to avoid coupling and be able to let the services evolve and scale without interfering with each other. There is a price to pay for that and the price is to write more code. You need to evaluate the thread-offs.
You can consume your own events, produce an OrderView and CustomerView and publish into Kafka, but that's the same as consuming the events directly on the Recommendation Service.
Note that you also need to store OrderView and CustomerView somewhere in the Recommendation Service. So you end up storing it three times. On the original service (view table), in Kafka and in the Recommendation Services.
That's why publishing events in a topic is the best option to propagate data between services.
Every time we receive a domain event from customers or orders, go to
them and ask them the state. This is horrible because if you have more
than one microservice that needs their state, you will end up
producing load on customers and orders
That is indeed a horrible idea because you will make the Recommendation Service be dependent on the other two services. If Order or Customer is down, the Recommendation will be down as well. That's what a broker helps to solve.
Have customers and orders not only publish events but also state and
having all the services that need to build materialized views listen
the state they need How do you apply the last pattern with Lagom? We
found no way to listen to state changes, just to events. One solution
we considered implied publishing with pubSub the state in the onEvent
handler of a persistent entity but I am not sure this is the right
place to make it happen.
Using pubSub in the onEvent handler is the worst solution of all. For the following reasons:
pubSub has at-most-once sematincs (see comments above)
Event handlers are called many times. Whenever you re-hydrate an Entity, the events are replayed and the the event handlers will be used for that. Which mean that you will re-publish the state each time. Actually, you would solve the at-most-once pubSub problem, but not the way you might expect/desire.
You could use the afterPersist callback for that, but that's not reliable neither because pubSub is at-most-once.
PubSub inside a PersistentEntity should not be used for something that you need to be reliable. It's a best-effort capability, that's all.

An event store could become a single point of failure?

Since a couple of days I've been trying to figure it out how to inform to the rest of the microservices that a new entity was created in a microservice A that store that entity in a MongoDB.
I want to:
Have low coupling between the microservices
Avoid distributed transactions between microservices like Two Phase Commit (2PC)
At first a message broker like RabbitMQ seems to be a good tool for the job but then I see the problem of commit the new document in MongoDB and publish the message in the broker not being atomic.
Why event sourcing? by eventuate.io:
One way of solving this issue implies make the schema of the documents a bit dirtier by adding a mark that says if the document have been published in the broker and having a scheduled background process that search unpublished documents in MongoDB and publishes those to the broker using confirmations, when the confirmation arrives the document will be marked as published (using at-least-once and idempotency semantics). This solutions is proposed in this and this answers.
Reading an Introduction to Microservices by Chris Richardson I ended up in this great presentation of Developing functional domain models with event sourcing where one of the slides asked:
How to atomically update the database and publish events and publish events without 2PC? (dual write problem).
The answer is simple (on the next slide)
Update the database and publish events
This is a different approach to this one that is based on CQRS a la Greg Young.
The domain repository is responsible for publishing the events, this
would normally be inside a single transaction together with storing
the events in the event store.
I think that delegate the responsabilities of storing and publishing the events to the event store is a good thing because avoids the need of 2PC or a background process.
However, in a certain way it's true that:
If you rely on the event store to publish the events you'd have a
tight coupling to the storage mechanism.
But we could say the same if we adopt a message broker for intecommunicate the microservices.
The thing that worries me more is that the Event Store seems to become a Single Point of Failure.
If we look this example from eventuate.io
we can see that if the event store is down, we can't create accounts or money transfers, losing one of the advantages of microservices. (although the system will continue responding querys).
So, it's correct to affirmate that the Event Store as used in the eventuate example is a Single Point of Failure?
What you are facing is an instance of the Two General's Problem. Basically, you want to have two entities on a network agreeing on something but the network is not fail safe. Leslie Lamport proved that this is impossible.
So no matter how much you add new entities to your network, the message queue being one, you will never have 100% certainty that agreement will be reached. In fact, the opposite takes place: the more entities you add to your distributed system, the less you can be certain that an agreement will eventually be reached.
A practical answer to your case is that 2PC is not that bad if you consider adding even more complexity and single points of failures. If you absolutely do not want a single point of failure and wants to assume that the network is reliable (in other words, that the network itself cannot be a single point of failure), you can try a P2P algorithm such as DHT, but for two peers I bet it reduces to simple 2PC.
We handle this with the Outbox approach in NServiceBus:
http://docs.particular.net/nservicebus/outbox/
This approach requires that the initial trigger for the whole operation came in as a message on the queue but works very well.
You could also create a flag for each entry inside of the event store which tells if this event was already published. Another process could poll the event store for those unpublished events and put them into a message queue or topic. The disadvantage of this approach is that consumers of this queue or topic must be designed to de-duplicate incoming messages because this pattern does only guarantee at-least-once delivery. Another disadvantage could be latency because of the polling frequency. But since we have already entered the eventually consistent area here this might not be such a big concern.
How about if we have two event stores, and whenever a Domain Event is created, it is queued onto both of them. And the event handler on the query side, handles events popped from both the event stores.
Ofcourse every event should be idempotent.
But wouldn’t this solve our problem of the event store being a single point of entry?
Not particularly a mongodb solution but have you considered leveraging the Streams feature introduced in Redis 5 to implement a reliable event store. Take a look this intro here
I find that it has rich set of features like message tailing, message acknowledgement as well as the ability to extract unacknowledged messages easily. This surely helps to implement at least once messaging guarantees. It also support load balancing of messages using "consumer group" concept which can help with scaling the processing part.
Regarding your concern about being the single point of failure, as per the documentation, streams and consumer information can be replicated across nodes and persisted to disk (using regular Redis mechanisms I believe). This helps address the single point of failure issue. I'm currently considering using this for one of my microservices projects.

Resources