Related
I would like to operate a service that anticipates having subscribers who are interested in various kinds of products. A product is a bag of dozens of attributes:
{
"product_name": "...",
"product_category": "...",
"manufacturer_id": "...",
[...]
}
A subscriber can express an interest in any subset of these attributes. For example, this subscription:
{ [...]
"subscription": {
"manufacturer_id": 1234,
"product_category": 427
}
}
will receive events that match both product_category: 427 and manufacturer_id: 1234. Conversely, this event:
{ [...]
"event": {
"manufacturer_id": 1234,
"product_category": 427
}
}
will deliver messages to any subscribers who care about:
that manufacturer_id, or
that product_category, or
both that manufacturer_id and that product_category
It is vital that these notifications be delivered as expeditiously as possible, because the subscriptions may have a few hundred milliseconds, or a second at most, to take downstream actions. The cache lookup should therefore be fast.
Question: If one wants to cache subscriptions this way for highly efficient lookup on one or more filterable attributes, what sort of approaches or architectures would allow one to do this well?
The answer depends on some factors that you have not described in your scenario. For example, what is the extent of the data? How many products/categories/users and what are the estimated data sizes for these- Megabytes, Gigabytes, Terabytes? Also what is the expected throughput of changes to products/subscriptions and events?
So my answer will be a for a medium size scenario in the Gigabytes range where you can likely fit your subscription dataset into memory on a machine.
In this case the straight forward approach would be to have your events appear on an event bus, for example implemented with Kafka or Pulsar. Then you would have a service that consumes the events as they come in and inquires an in memory data store about the subscription matches. (The in-memory db has to be built/copied on startup and kept up to date from a different event source potentially.)
This in-memory store could be a key-value database like MongoDB for example. It comes with an pure in-memory-mode that gives you more predictable performance. In order to ensure predictable, high performance lookups within the db you need to specify your indexes correctly. Any property that is relevant to the lookup needs to be indexed. Also consider that kv-stores can use compound indexes for speeding up lookups of property combinations. Other in-memory kv-stores that you may want to consider as alternatives are redis or mem-cached. If performance is a critical requirement I would recommend to do trials with different systems where you ingest your dataset, build index and try out the queries you need for comparing lookup times.
So the service can now quickly determine the set of users to notify. From here you have two choices - You could have the same service send out notifications directly, or (what I would probably do) you could separate concerns and have a second service whose responsibility is performing the actual notifications. The communication between those services could again be via a topic on the event bus system.
This kind of setup should easily work up to thousands of events per second with single service instances. If it should happen that the number of events scales to massive sizes you can run multiple instances of your services to improve throughput. For that you'd have to look into organizing consumer groups correctly for multiple consumers.
The technologies for implementing the services are probably not critical, but if I'd knew it has strict performance requirements I would go with a language that potentially has manual memory management. For example Rust or C++. Other alternatives could be languages like golang or java, but you'd have to pay attention to how garbage collection is performed and that it doesn't interfere with your performance requirements.
In terms of infrastructure - For a medium or large size system you would typically run your services in a containerized fashion on a cluster of machines, for example using kubernetes.
If it happens that your system scale is on the smaller side you may not need a distributed setup and instead can deploy the described components/services on a single machine.
With such a setup the expected round trip time from a local client should reliably be in the single digit milliseconds from the time the event comes in and a notification goes out.
The way I would do that is having a key/value table that holds an array for the "subscribers ids" by attribute name = value, like this: (where a,b,c,d,y,z are the subscriber's ids).
{ [...]
"manufacturer_id=1234": [a,b,c,d],
"product_category=427": [a,b,y,z],
[...]
}
In you example your event has "manufacturer_id" = 1234 and "product_category" = 427, so just search for the subscribers where key = manufacturer_id=1234 or product_category=427 and you'll get arrays of all subscribers you want. Then just "merge distinct" those arrays and you'll have every subscribe id you need.
Or, depending of how complex/smart is the database you are using, you can normalize it, like this:
{ [...]
"manufacturer_id": {
"1234": [a,b,c,d],
"5678": [e,f,g,h],
[...]
},
"product_category": {
"427": [a,b,g,h],
"555": [c],
[...]
},
[...]
}
I would propose sharding as an architecture pattern.
Every shard will listen for all events for all products from the source of the events.
For best latency I would propose two layers of sharding, first layer is geographical (country or city depending on customer distribution), it is connected to the source with low latency connection and it is in the same data center as the second level sharding for this location. Second level sharding is on userId and it needs to be receiving all product events, but handle subscriptions only for it's region.
The first layer has the responsibility to fan out the events to the second layer based on geographical position of the subscriptions. This is more or less a single microservice. It can be done with genral event brokers but considering it is going to be relatively simple we can implement it in golang or C++ and optimize for latency.
For the second layer every shard will respond for a number of users from the location, every shard will receive all the events for all products. A shard will be made from one microservice for subscriptions caching and notify logic and one or more notifications delivery microservices.
The subscriptions microservice stores an in memory cache of the subscriptions and checks every event for subscribed users based on maps. I.e. It stores a map from product field to subcribed userIds for example. For this microservice latency is more important so a custom implementation in golang / C++ should deliver the best latency. The subscriptions microservice should not have it's own db or any external cache as network latency is a just a drag in this case.
The notifications delivery microservices are dependant on where you want to send the notifications, but again golang or C++ can deliver one of the lowest latencies.
The system data is it's subscriptions, the data can be sharded per location and userId the same way as the rest of the architecture. So we can have a single DB per second level shard.
For storage of the product fields delending on how often they change they can be: in the code (presuming very rarely changed or never) or in the dbs, with synchronisation mechanism between the dbs if they are expected to change more often.
We are working on an IOT platform, which ingests many device parameter
values (time series) every second from may devices. Once ingested the
each JSON (batch of multiple parameter values captured at a particular
instance) What is the best way to track the JSON as it flows through
many microservices down stream in an event driven way?
We use spring boot technology predominantly and all the services are
containerised.
Eg: Option 1 - Is associating UUID to each object and then updating
the states idempotently in Redis as each microservice processes it
ideal? Problem is each microservice will be tied to Redis now and we
have seen performance of Redis going down as number api calls to Redis
increase as it is single threaded (We can scale this out though).
Option 2 - Zipkin?
Note: We use Kafka/RabbitMQ to process the messages in a distributed
way as you mentioned here. My question is about a strategy to track
each of this message and its status (to enable replay if needed to
attain only once delivery). Let's say a message1 is being by processed
by Service A, Service B, Service C. Now we are having issues to track
if the message failed getting processed at Service B or Service C as
we get a lot of messages
Better approach will be using Kafka instead of Redis.
Create a topic for every microservice & keep moving the packet from
one topic to another after processing.
topic(raw-data) - |MS One| - topic(processed-data-1) - |MS Two| - topic(processed-data-2) ... etc
Keep appending the results to same object and keep moving it down the line, untill every micro-service has processed it.
Given service A (CMS) that controls a model (Product, let's assume the only fields that it has are id, title, price) and services B (Shipping) and C (Emails) that have to display given model what should the approach be to synchronize given model information across those services in event sourcing approach? Let's assume that product catalog rarely changes (but does change) and that there are admins that can access data of shipments and emails very often (example functionalities are: B:display titles of products the order contained and C:display content of email about shipping that is going to be sent). Each of the services has their own DB.
Solution 1
Send all required information about Product within event - this means following structure for order_placed:
{
order_id: [guid],
product: {
id: [guid],
title: 'Foo',
price: 1000
}
}
On service B and C product information is stored in product JSON attribute on orders table
As such, to display necessary information only data retrieved from the event is used
Problems: depending upon what other information needs to be presented in B and C, amount of data in event can grow. B and C might not require the same information about Product, but the event will have to contain both (unless we separate the events into two). If given data is not present within given event, code can not use it - if we'll add a color option to given Product, for existing orders in B and C, given product will be colorless unless we update the events and then rerun them.
Solution 2
Send only guid of product within event - this means following structure for order_placed:
{
order_id: [guid],
product_id: [guid]
}
On services B and C product information is stored in product_id attribute on orders table
Product information is retrieved by services B and C when required by performing an API call to A/product/[guid] endpoint
Problems: this makes B and C dependant upon A (at all times). If schema of Product changes on A, changes have to be done on all services that depend on them (suddenly)
Solution 3
Send only guid of product within event - this means following structure for order_placed:
{
order_id: [guid],
product_id: [guid]
}
On services B and C product information is stored in products table; there's still product_id on orders table, but there's replication of products data between A, B and C; B and C might contain different information about Product than A
Product information is seeded when services B and C are created and are updated whenever information about Products changes by making call to A/product endpoint (that displays required information of all products) or by performing a direct DB access to A and copying necessary product information required for given service.
Problems: this makes B and C dependant upon A (when seeding). If schema of Product changes on A, changes have to be done on all services that depend on them (when seeding)
From my understanding, the correct approach would be to go with solution 1, and either update events history per certain logic (if Product catalog hasn't changed and we want to add color to be displayed, we can safely update history to get current state of Products and fill missing data within the events) or cater for nonexistence of given data (if Product catalog has changed and we want to add color to be displayed, we can't be sure if at that point in time in the past given Product had a color or not - we can assume that all Products in previous catalog were black and cater for by updating events or code)
Solution #3 is really close to the right idea.
A way to think about this: B and C are each caching "local" copies of the data that they need. Messages processed at B (and likewise at C) use the locally cached information. Likewise, reports are produced using the locally cached information.
The data is replicated from the source to the caches via a stable API. B and C don't even need to be using the same API - they use whatever fetch protocol is appropriate for their needs. In effect, we define a contract -- protocol and message schema -- which constrain the provider and the consumer. Then any consumer for that contract can be connected to any supplier. Backward incompatible changes require a new contract.
Services choose the appropriate cache invalidation strategy for their needs. This might mean pulling changes from the source on a regular schedule, or in response to a notification that things may have changed, or even "on demand" -- acting as a read through cache, falling back to the stored copy of the data when the source is not available.
This gives you "autonomy", in the sense that B and C can continue to deliver business value when A is temporarily unavailable.
Recommended reading: Data on the Outside, Data on the Inside, Pat Helland 2005.
Generally speaking, I'd strongly recommend against option 2 because of the temporal coupling between those two service (unless communication between these services is super stable, and not very frequent). Temporal coupling is what you describe as this makes B and C dependant upon A (at all times), and means that if A is down or unreachable from B or C, B and C cannot fulfill their function.
I personally believe that both options 1 and 3 have situations where they are valid options.
If the communication between A and B & C is so high, or the amount of data needed to go into the event is large enough to make it a concern, then option 3 is the best option, because the burden on the network is much lower, and latency of operations will decrease as the message size decreases. Other concerns to consider here are:
Stability of contract: if the contract of message leaving A changed often, then putting a lot of properties in the message would result in lots of changes in consumers. However, in this case I believe this to not be a big problem because:
You mentioned that system A is a CMS. This means that you're working on a stable domain and as such I don't believe you'll be seeing frequent changes
Since the B and C are shipping and email, and you're receiving data from A, I believe you'll be experiencing additive changes instead of breaking ones, which are safe to add whenever you discover them with no rework.
Coupling: There is very little to no coupling here. First since the communication is via messages, there is no coupling between the services other than a short temporal one during seeding of the data, and the contract of that operation (which is not a coupling you can or should try to avoid)
Option 1 is not something I'd dismiss though. There is the same amount of coupling, but development-wise it should be easy to do (no need for special actions), and stability of the domain should mean that these won't change often (as I mentioned already).
Another option I'd suggest is a slight variation to 3, which is not to run the process during start-up, but instead observe a "ProductAdded and "ProductDetailsChanged" event on B and C, wheneve there is a change in the product catalogue in A. This would make your deployments faster (and so easier to fix a problem/bug if you find any).
Edit 2020-03-03
I have a specific order of priorities when determining the integration approach:
What is the cost of consistency? Can we accept some milliseconds of inconsistency between things changed in A and them being reflected in B & C?
Do you need point-in-time queries (also called temporal queries)?
Is there any source of truth for the data? A service which owns them and is considered upstream?
If there is an owner / single source of truth is that stable? Or do we expect to see frequent breaking changes?
If the cost of inconsistency is high, (basically the product data in A need to be consistent as soon as possible with product cached in B and C), then youb cannot avoid needing to accept unavaibility, and make a synchronous request (like a web/rest request) from B & C to A to fetch the data. Be aware! This still does not mean transactionally consistent, but just minimizes the windows for inconsistency. If you absolutely, positively have to be immediately consistent, you need to rething your service boundaries. However, I very strongly believe this should not be a problem. From experience, it's actually extremely rare that the company can't accept some seconds of inconsistency, so you shouldn't even need to make synchronous requests.
If you do need point-in-time queries (which I didn't notice in your question and hence didn't include above, maybe wrongly), the cost of maintaining this on downstream services is so high (you'd need to duplicate internal event projection logic in all downstream services) that makes the decision clear: you should leave ownership to A, and query A ad-hoc over web request (or similar), and A should use event sourcing to retrieve all the events you knew about at the time to project to the state, and return it. I guess this may be option 2 (if I understood correctly?), but the costs are such that while temporal coupling is better than maintainance cost of duplciated events and projection logic.
If you don't need a point in time, and there isn't a clear, single owner of the data (which in my initial answer I did assume this based on your question), then a very reasonable pattern would be to hold representations of the product in each service separately. When you update the data for products, you update A, B and C in parallel by making parallel web requests to each one, or you have a command API which send multiple commands to each of A, B and C. B & C use their local version of the data to do their job, which may or may not be stale. This isn't any of the options above (although it could be made to be close to option 3), as data in A, B and C may differ, and the "whole" of the product may be a composition of all three data sources.
Knowing if the source of truth is has a stable contract is useful because you can use it to use the domain/internal events (or events you store in your event sourcing as storage pattern in A) for integration across A and services B and C. If the contract is stable you can integrate through the domain events. However, then you have an additional concern in the case where changes are frequent, or that contract of message is large enough that make transport a concern.
If you have a clear owner, with a contrac that is expected to be stable, the best options would be option 1; an order would contain all necessary information and then B and C would do their function using the data in the event.
If the contract is liable to change, or break often, following your option 3, that is falling back to web requests to fetch product data is actually a better option, since it's a much easier to maintain multiple versions. So B would make a request on v3 of product.
There are two hard things in Computer Science, and one of them is cache invalidation.
Solution 2 is absolutely my default position, and you should generally only consider implementing caching if you run into one of the following scenarios:
The API call to Service A is causing performance problems.
The cost of Service A being down and being unable to retrieve the data is significant to the business.
Performance problems are really the main driver. There are many ways of solving #2 that don't involve caching, like ensuring Service A is highly available.
Caching adds significant complexity to a system, and can create edge cases that are hard to reason about, and bugs that are very hard to replicate. You also have to mitigate the risk of providing stale data when newer data exists, which can be much worse from a business perspective than (for example) displaying a message that "Service A is down--please try again later."
From this excellent article by Udi Dahan:
These dependencies creep up on you slowly, tying your shoelaces
together, gradually slowing down the pace of development, undermining
the stability of your codebase where changes to one part of the system
break other parts. It’s a slow death by a thousand cuts, and as a
result nobody is exactly sure what big decision we made that caused
everything to go so bad.
Also, If you need point-in-time querying of product data, this should be handled in the way the data is stored in the Product database (e.g. start/end dates), should be clearly exposed in the API (effective date needs to be an input for the API call to query the data).
It is very hard to simply say one solution is better than the other. Choosing one among Solution #2 and #3 depends on other factors (cache duration, consistency tolerance, ...)
My 2 cents:
Cache invalidation might be hard but the problem statement mentions that product catalog change rarely. This fact make product data a good candidate for caching
Solution #1 (NOK)
Data is duplicated across multiple systems
Solution #2 (OK)
Offers strong consistency
Works only when product service is highly available and offers good performance
If email service prepares a summary (with lot of products), then the overall response time could be longer
Solution #3 (Complex but preferred)
Prefer API approach instead of direct DB access to retrieve product information
Resilient - consuming services are not impacted when product service is down
Consuming applications (shipping and email services) retrieve product details immediately after an event is published. The possibility of product service going down within these few milliseconds is very remote.
I have a system composed of 3 sensors (Temperature, humidity, camera) attached to Arduino, 1 cloud, and 1 Mobile phone. I developed a monolithic IoT application that has different tasks needed to be executed in these three different locations (Arduino, cloud Mobile). all these sensors have common tasks which are: data detection, data transferring (executed on Arduino), data saving, data analysis and data notification (on the cloud), data visualization (on Mobile).
The problem here I know that a microservice is independent and it has its database. How to transform this application that I have to a one using microservice architecture? the first idea is representing each task as a microservice.
At the first, I considered each task as a component and I thought to represent each one as a microservice but they are linked. I mean that the output of the previous task is the input of the present one, So I can't make it like this because they aren't independent. Another thing for data collection microservice it should be placed on Arduino and the data should be sent to the cloud to be stored there in the database, so here we have a distant DB. For the data collection, I have the same idea as you since there are different things (sensors) so there will be diff microservices like (temperature data collection, camera data collection...).
First let me clear a confusion : when we say microservices are independent then how can we design a microservice in which output of the previous task is input for the next one.
First when we say microservice it means it is indepently deployable and manageable but as in any system there are dependencies microservices also depends upon each other. You can read about reactice microservice.
So you can have microservices which depend on one another, but we want these dependencies to be minimum.
Now lets understand benefits we want to adopt while doing microservice (this will help to answer your question):
Indepently deployable components (which scale up the deployment speed)- As in any big application there are components which are relatively independent of each other then if I want to change something in one component I should be confident another will not be impacted. In monolithic as all are inone binary impact would be high.
Independently Scalable - as diff. components require diff. scale we can have diff. types of databases and machine requirement.
and there are various and also some overhead which a microservice architecture bring (cant go in detail here , read on these things online)
NOW WE WILL DISCUSS the approach
As data collection is independent on how and what kind of analysis happen on that. I would have a DataCollectionService on cloud (collects data from all sensors, we create diff. for diff. sensors if those data are completely independent).
DataAnalysis as separate service (dosent need to know a thing about how data is collected like is it using mqtt , webscoket , periodic or in batches or whatever). This service needs data and will act upon it.
Notification Service
DataSendClient on Arduino : some client which sends data to data collection service.
Two General Problems - EventStore and persistence layer?
I would like to understand how industry is actually dealing with this problems!
If a microservice 1 persists object X into Database A. In the same time, for micro-service 2 to feed on the data from micro-service 1, micro-service 1 writes the same object X to an event store B.
Now, the question I have is, where do I write object X first?
Database A first and then to event store B, is it fair to roll back the thread at the app level if Database A is down? Also, what should be the ideal error handle if Database A is online and persisted object X but event store B is down?
What should be the error handle look like if we go vice-versa of point 1?
I do understand that in today's world of distributed high-available systems, systems going down is questionable thing. But, it can happen. I want to understand what needs to be done when either database or event store system/cluster is down?
In general you want to avoid relying on a two-phase commit of the kind you describe.
In general, (presuming an event-sourced system; not sure if that's implicit in your question/an option for you - perhaps SqlStreamStore might be relevant in your context?), this is typically managed by having something project from from a single authoritative set of events on a pull basis - each event being written that requires an associated action against some downstream maintains a pointer to how far it has got projecting events from the base stream, and restarts from there if interrupted.
First of all, an Event store is a type of Persistence, which stores the applications state as a series of events as opposed to a flat persistence that stores the last projected state.
If a microservice 1 persists object X into Database A. In the same time, for micro-service 2 to feed on the data from micro-service 1, micro-service 1 writes the same object X to an event store B.
You are trying to have two sources of truth that must be kept in sync by some sort of distributed transaction which is not very scalable.
This is an unusual mode of using an Event store. In general an Event store is the canonical source of information, the single source of truth. You are trying to use it as an communication channel. The Event store is the persistence of an event-sourced Aggregate (see Domain Driven Design).
I see to options:
you could refactor your architecture and make the object X and event-sourced entity having as persistence the Event store. Then have a Read-model subscribe to the Event store and build a flat representation of the object X that is persisted in the database A. In other words, write first to the Event store and then in the Database A (but in an eventually consistent manner!). This is a big jump and you should really think if you want to go event-sourced.
you could use CQRS without Event sourcing. This means that after every modification, the object X emits one or more Domain events, that are persisted in the Database A in the same local transaction as the object X itself. The microservice 2 could subscribe to the Database A to get the emitted events. The actual subscribing depends on the type of database.
I have a feeling you are using event store as a channel of communication, instead of using it as a database. If you want micro-service 2 to feed on the data from micro-service 1, then you should communicate with REST services.
Of course, relying on REST services might make you less resilient to outages. In that case, using a piece of technology dedicated to communication would be the right way to go. (I'm thinking MQ/Topics, such as RabbitMQ, Kafka, etc.)
Then, once your services are talking to each other, you will still need to persist your data... but only at one single location.
Therefore, you will need to define where you want to store the data.
Ask yourself:
Who will have the governance of the data persistance ?
Is it Microservice1 ? if so, then everytime Microservice2 needs to read the data, it will make a REST call to Microservice1.
is it the other way around ? Microservice2 has the governance of the data, and Microservice1 consumes it ?
It could be a third microservice that you haven't even created yet. It depends how you applied your separation of concerns.
Let's take an example :
Microservice1's responsibility is to process our data to export them in PDF and other formats
Microservice2's responsibility is to expose a service for a legacy partner, that requires our data to be returned in a very proprietary representation.
who is going to store the data, here ?
Microservice1 should not be the one to persist the data : its job is only to convert the data to other formats. If it requires some data, it will fetch them from the one having the governance of the data.
Microservice2 should not be the one to persist the data. After all, maybe we have a number of other Microservices similar to this one, but for other partners, with different proprietary formats.
If there is a service where you can do CRUD operations, this is your guy. If you don't have such a service, maybe you can find an existing Microservice who wouldn't have conflicting responsibilities.
For instance : if I have a Microservice3 that makes sure everytime an my ObjectX is changed, it will send a PDF-representation of it to some address, and notify all my partners that the data are out-of-date. In that scenario, this Microservice looks like a good candidate to become the "governor of the data" for this part of the domain, and be the one-stop-shop for writing/reading in the database.