Documenting the aggregates & relations between microservices - microservices

In my organization we're trying to design our microservices based on the Bounded Context (BC) pattern (part of Domain-driven design). While we're doing this we also try to use another DDD pattern called the Context Mapping, to better identify the various contexts in the application, their boundaries and the relations between them.
All of this can be done on a whiteboard or in some online drawing tool. However, I'm looking for a way to generate a complete picture of the various services, what aggregates they contain and potentially the relations between such aggregates (as the same User in one BC might be a Customer in another). A good example is figure 4-10 in here. The generation should ideally be based on some DSL or script which we would maintain, as this kind of work is fairly high-level and context boundaries don't change very often. For example, a team adds a new aggregate or starts keeping a copy of an aggregate from another service, they update the script/DSL and regenerate the diagram.
Solutions I've looked at so far:
Context mapper - it doesn't visualise the aggregates in each BC/service, nor does it show relations
C4 model, Level 2 - we already use it, so it could be fairly easy to add a textual list of aggregates per container, but it's not what it's intended for (and the visualisation is not optimal)
ddd bounded context/microservice canvas - it's too detailed and can't really be used to look at the big picture
I'm wondering how and if this is done in other organization, and looking for suggestions for some tooling that would be of help.

I think the format used for event storming sessions might be worth to have look at in your case. Once done it covers all domain events, commands, actors, read models, policies, external systems. Also it illustrates the bounded contexts in which the aggregates, events, etc. live in. An example can be found here:
https://medium.com/capital-one-tech/event-storming-decomposing-the-monolith-to-kick-start-your-microservice-architecture-acb8695a6e61
I know the format is mainly used for domain exploration but from my experience, if done nicely (e.g. using some tool like Miro, Lucid, or the like) it also provides good documentation and overview of what's going on in your system.

Related

Kedro Data Modelling

We are struggling to model our data correctly for use in Kedro - we are using the recommended Raw\Int\Prm\Ft\Mst model but are struggling with some of the concepts....e.g.
When is a dataset a feature rather than a primary dataset? The distinction seems vague...
Is it OK for a primary dataset to consume data from another primary dataset?
Is it good practice to build a feature dataset from the INT layer? or should it always pass through Primary?
I appreciate there are no hard & fast rules with data modelling but these are big modelling decisions & any guidance or best practice on Kedro modelling would be really helpful, I can find just one table defining the layers in the Kedro docs
If anyone can offer any further advice or blogs\docs talking about Kedro Data Modelling that would be awesome!
Great question. As you say, there are no hard and fast rules here and opinions do vary, but let me share my perspective as a QB data scientist and kedro maintainer who has used the layering convention you referred to several times.
For a start, let me emphasise that there's absolutely no reason to stick to the data engineering convention suggested by kedro if it's not suitable for your needs. 99% of users don't change the folder structure in data. This is not because the kedro default is the right structure for them but because they just don't think of changing it. You should absolutely add/remove/rename layers to suit yourself. The most important thing is to choose a set of layers (or even a non-layered structure) that works for your project rather than trying to shoehorn your datasets to fit the kedro default suggestion.
Now, assuming you are following kedro's suggested structure - onto your questions:
When is a dataset a feature rather than a primary dataset? The distinction seems vague...
In the case of simple features, a feature dataset can be very similar to a primary one. The distinction is maybe clearest if you think about more complex features, e.g. formed by aggregating over time windows. A primary dataset would have a column that gives a cleaned version of the original data, but without doing any complex calculations on it, just simple transformations. Say the raw data is the colour of all cars driving past your house over a week. By the time the data is in primary, it will be clean (e.g. correcting "rde" to "red", maybe mapping "crimson" and "red" to the same colour). Between primary and the feature layer, we will have done some less trivial calculations on it, e.g. to find one-hot encoded most common car colour each day.
Is it OK for a primary dataset to consume data from another primary dataset?
In my opinion, yes. This might be necessary if you want to join multiple primary tables together. In general if you are building complex pipelines it will become very difficult if you don't allow this. e.g. in the feature layer I might want to form a dataset containing composite_feature = feature_1 * feature_2 from the two inputs feature_1 and feature_2. There's no way of doing this without having multiple sub-layers within the feature layer.
However, something that is generally worth avoiding is a node that consumes data from many different layers. e.g. a node that takes in one dataset from the feature layer and one from the intermediate layer. This seems a bit strange (why has the latter dataset not passed through the feature layer?).
Is it good practice to build a feature dataset from the INT layer? or should it always pass through Primary?
Building features from the intermediate layer isn't unheard of, but it seems a bit weird. The primary layer is typically an important one which forms the basis for all feature engineering. If your data is in a shape that you can build features then that means it's probably primary layer already. In this case, maybe you don't need an intermediate layer.
The above points might be summarised by the following rules (which should no doubt be broken when required):
The input datasets for a node in layer L should all be in the same layer, which can be either L or L-1
The output datasets for a node in layer L should all be in the same layer L, which can be either L or L+1
If anyone can offer any further advice or blogs\docs talking about Kedro Data Modelling that would be awesome!
I'm also interested in seeing what others think here! One possibly useful thing to note is that kedro was inspired by cookiecutter data science, and the kedro layer structure is an extended version of what's suggested there. Maybe other projects have taken this directory structure and adapted it in different ways.
Your question prompted us to write a Medium article better explaining these concepts, it's just been published on Toward Data Science

Differences between OT and CRDT

Can someone explain me simply the main differences between Operational Transform and CRDT?
As far as I understand, both are algorithms that permits data to converge without conflict on different nodes of a distributed system.
In which usecase would you use which algorithm?
As far as I understand, OT is mostly used for text and CRDT is more general and can handle more advanced structures right?
Is CRDT more powerful than OT?
I ask this question because I am trying to see how to implement a collaborative editor for HTML documents, and not sure in which direction to look first. I saw the ShareJS project, and their attempts to support rich text collaboration on the browser on contenteditables elements. Nowhere in ShareJS I see any attempt to use CRDT for that.
We also know that Google Docs is using OT and it's working pretty well for real-time edition of rich documents.
Is Google's choice of using OT because CRDT was not very known at that time? Or would it be a good choice today too?
I'm also interested to hear about other use cases too, like using these algorithms on databases. Riak seems to use CRDT. Can OT be used to sync nodes of a database too and be an alternative to Paxos/Zab/Raft?
Both approaches are similar in that they provide eventual consistency. The difference is in how they do it. One way of looking at it is:
OT does it by changing operations. Operations are sent over the wire and concurrent operations are transformed once they are received.
CRDTs do it by changing state. Operations are made on the local CRDT. Its state is sent over the wire and is merged with the state of a copy. It doesn't matter how many times or in what order merges are made - all copies converge.
You're right, OT is mostly used for text and does predate CRDTs but research shows that:
many OT algorithms in the literature do not satisfy convergence properties
unlike what was stated by their authors
In other words CRDT merging is commutative while OT transformation functions sometimes are not.
From the Wikipedia article on CRDT:
OTs are generally complex and non-scalable
There are different kinds of CRDTs (sets, counters, ...) suited for different kinds of problems. There are some that are designed for text editing. For example, Treedoc - A commutative replicated data type for cooperative editing.
One another notable difference is that:
OT requires a central server for co-ordination.
CRDT can adopt any network topology like P2P over WebRTC and it is resilient to network partitions, which makes it decentralized.
Reference: https://youtu.be/B5NULPSiOGw?t=643 by Martin Kleppmann, author of "Designing Data-Intensive Applications".

Prioritizing text based on content

If you have a list of texts and a person interested in certain topics what are the algorithms dealing with choosing the most relevant text for a given person?
I believe that this is quite a complex topic and as an answer I expect a few directions to study various methodologies of text analysis, text statistics, artificial intelligence etc.
thank you
There are quite a few algorithms out there for this task. At least way too many to mention them all here. First some starting points:
Topic discovery and recommendation are two quite distinctive tasks, although they often overlap. If you have a stable userbase, you might be able to give very good recommendations without any topic discovery.
Discovering topics and assigning names to them are also two different tasks. This means it is often easier to be able to tell that text A and text B share a similar topic, than to explicetly be able to state what this common topic might be. Giving names to the topics is best done by humans, for example by having them tag the items.
Now to some actual examples.
TF-IDF is often a good starting point, however it also has severe drawbacks. For example it will not be able to tell that "car" and "truck" in two texts mean that these two probably share a topic.
http://websom.hut.fi/websom/ A Kohonen map for automatically clustering data. It learns the topics and then organizes the texts by topics.
http://de.wikipedia.org/wiki/Latent_Semantic_Analysis Will be able to boost TF-IDF by detecting semantic similarity among different words. Also note, that this has been patented, so you might not be able to use it.
Once you have a set of topics assigned by users or experts, you can also try almost any kind of machine learning method (for example SVM) to map the TF-IDF data to topics.
As a search engine engieneer I think this problem is best solved using two techniques in conjuction.
Technology 1, Search (TF-IDF or other algorithms)
Use search to create a baseline model for content where you dont have user statistics. There are a number of technologies out there but I think the Apache Lucene/Solr code base is by fare the most mature and stable.
Technology 2, User based recommenders (k-nearest neighborhood other algorithms)
When you start getting user statistics use this to enhance your relevance model used by the text analysis system. A fast growing codebase to solv these kinds of problem is the Apache Mahout project.
Check out Programming Collective Intelligence, a really good overview of various techniques along these lines. Also very readable.

Algorithm to handle data aggregation from multiple error-prone sources

I'm aggregating concert listings from several different sources, none of which are both complete and accurate. Some of the data comes from users (such as on last.fm), and may be incorrect. Other data sources are highly accurate, but may not contain every event. I can use attributes such as the event date, and the city/state to try to match listings from disparate sources. I'd like to be reasonably certain that the events are valid. It seems like it would be a good strategy to consume as many different sources as possible to validate listings on error-prone sources.
I'm not sure what the technical term for this is, as I'd like to research it further. Is it data mining? Are there any existing algorithms? I understand a solution will never be completely accurate.
Here is an approach that locates it within statistics - specifically, it uses a Hidden Markov Model (http://en.wikipedia.org/wiki/Hidden_Markov_model):
1) Use your matching process to produce a cleaned list of possible events. Consider each event to be marked "true" or "bogus", even though the markings are hidden from you. You might imagine that some source of events produces them, generating them as either "true" or "bogus" according to a probability which is an unknown parameter.
2) Associate unknown parameters with each source of listings. These give the probability that this source will report a true event produced by the source of events, and the probability that it will report a bogus event produced by the source.
3) Notice that if you could see the markings of "true" or "bogus" you could easily work out the probabilities for each source. Unfortunately, of course, you can't see these hidden markings.
4) Let's call these hidden markings "Latent Variables" because then you can use the http://en.wikipedia.org/wiki/Em_algorithm to hillclimb to promising solutions for this problem, from random starts.
5) You can obviously make the problem more complicated by dividing events up into classes, and giving sources of listing parameters which make them more likely to report some classes of events than others. This might be useful if you have sources that are extremely reliable for some sorts of events.
I believe the term you are looking for is Record Linkage -
the process of bringing together two or more records relating to the same entity(e.g., person, family, event, community, business, hospital, or geographical area)
This presentation (PDF) looks like a nice introduction to the field. One algorithm you might use is Fellegi-Holt - a statistical method for editing records.
One potential search term is "fuzzy logic".
I'd use a float or double to store a probability (0.0 = disproved ... 1.0 = proven) of some event details being correct. As you encounter sources, adjust the probabilities accordingly. There's a lot for you to consider though:
attempting to recognise when multiple sources have copied from each other and reduce their impact
giving more weight to more recent data or data that explicitly acknowledges the old data (e.g. given a 100% reliable site saying "concert X to be held on 4th August", and a unknown blog alleging "concert X moved from 4th August to 9th", you might keep the probability of there being such a concert at 100% but have a list with both dates and whatever probabilities you think appropriate...)
beware assuming things are discrete; contradictory information may reflect multiple similar events, dual billing, same-surnamed performers etc. - the more confident you are that the same things are referenced, the more the data can combined to reinforce or negate each other
you should be able to "backtest" your evolving logic by using data related to a set of concerts where you now have full knowledge of their actual staging or lack thereof; process data posted before various cut-off dates prior to the events to see how the predictions you derive reflect the actual outcomes, tweak and repeat (perhaps automatically)
It may be most practical to start scraping from the sites you have, then consider the logical implications of the types of information you're seeing. Which aspects of the problem need to be handled using fuzzy logic can then be decided. An evolutionary approach may mean reworking things, but may end up faster than getting bogged down in a nebulous design phase.
Data mining is about finding information from structured sources like a database, or a post where the fields are separated for you. There's some text mining in here when you have to parse the information out of free text. In either case, you could keep track of how many data sources agree on a show as a confidence measure. Either display the confidence measure or use it to decide if your data is good enough. There's lots to play with. Having a list of legitimate cities, venues and acts can help you decide if a string represents a legitimate entity. Your lists might even be in a database that lets you compare city and venue for consistency.

Mixing logic and graphics in a class with MVC pattern

I am currently developing a charting application (for the iPhone, although that is largely irrelevant) using their MVC pattern.
One aspect of the application is that you can overlay a number of statistics on the charts. I am a little unsure how I am going to structure these classes.
For each statistic there will be two aspects.
1. The calculation. The function which will take the data and calculate the relevant statistical figures.
2. The display. The statistics then need to be drawn over the top of the graph.
Obviously I want the code to comply with the MVC pattern as closely as possible, but I am planning to develop possibly hundreds of these statistics.
I could create three classes. One for the graphics, one for the logic and a factory class to tie the two together. This would then fit with the pattern, but this seems to be a huge extra overhead in terms of the number of classes in the system and additional complexity which I dont feel is necessary.
So, I am very tempted to create a single class for each statistic. But that would mean each class would have logic and graphics mixed in together, which is heavily frowned upon.
Are there any other suggestions as to how I can lay these out in a structured reuseable way without adding uneccessary complexity?
EDIT
Thanks for the answers. Most useful, but has raised more questions!
MVC does fit the rest of the application perfectly. Also as its for the iPhone, I seem to be pushed along this path anyway. This is the only reason I am considering MVC for these statistics.
However, for these statistics, the user will not interact with them, they are purely for display. The statistics are painting various lines and symbols directly onto the view canvas. Each statistic paints their information in its own way. There is very little that can be shared between each one, also each piece of data can only be useful represented in one way. I can think of no other useful way that I would want to represent the information.
So it seems MVC is out for these, but I am unsure now what pattern would fit other than my newly invented "Mix logic and graphics" pattern which just feels wrong due to the Single Responsibility Principle (thanks for that link).
First of all, will the user interact with the statistics? If not, then you don't need MVC. (The Controller in MVC deals with user interaction).
You want to keep the number of classes to a minimum, which is good. Let's consider both the calculation and display separately.
How will you be displaying the statistics? Will they typically be text labels, or will there be other graph elements (error bars or things like that)? Try to figure out the different ways you will want to display your statistics.
How many different calculations will you have? Does each calculation map directly to a single graph element, or might it be drawn in a number of different ways? Try to figure out how the calculations relate to the graph elements.
As a concrete example, suppose you have a set of data points that you have plotted. You want to display the mean, the median, and the mode. You could display each of these as separate horizontal lines that cut through the chart at the appropriate Y value. The calculations are all independent, but the display logic can be shared. Or, perhaps you want to display the mean as both a line and as a text label. Here, there is only one calculation, but two different display methods.
MVC design is about separating the underlying data from its presentation. By doing this, you can re-use bits of presentation logic for many different pieces of data. Also, you can display a single piece of data in multiple ways, and they will all stay in sync.
It depends on what you call complex. Most people consider methods or classes that are responsible for multiple things complex and classes or methods that are responsible for one thing simple. This is also known as the SRP
When you use MVC, the view contains the display logic, the model contains the business logic (the calculation in your case), and the controller ties them together. You can use different ways to implement MVC. Complex applications need to seperate domain model and map to a viewmodel, but most simple applications can use one model.
If you define complexity on a different way, MVC might not be what you like, and you should try a different approach.

Resources