A domain model defines a.o. the relationships between entities and we define aggregate roots to provide encapsulation and transaction boundaries. The well known relationships are one-to-one relationships (an entity or value object is contained within an aggregate root), one-to-many relationships (an aggregate root contains a collection of child objects) and many-to-many relationships. The latter are difficult, because many-to-many relationships between aggregate roots get you in trouble with the transaction boundary. Thus, in many cases, one direction of the many-to-many relationship is seen as more important and only that relation is modeled as a one-to-many relation.
Now, take it one step further. Networks. Many-to-many relationships between equivalent partners. How can you model that without violating the transaction boundary on your aggregate roots?
Have a look at this widely-applicable example:
I have a network with nodes. Every node has a limited amount of ports. One port can only be connected to one port on another node. I have to be able to add and remove connections between nodes, using the ports.
An intuitive approach to this would be to model the nodes as aggregate roots containing ports. Connections seem to be value objects and one port can have one connection. I could implement a Node.ConnectTo(nodeId, portId) method which add the connection (between port X on node A and port Y on node B) to the aggregate root, node A. Preferably, I would call this method twice, once on Node A and once on Node B and wrap it in a transaction. However, this would violate the transaction boundary, so I decide to only store it on Node A.
To see the connection on node B on the application client, a separate read model would be needed. But that's no problem, the CQRS architecture provides us these possibilities. So, adding, removing and viewing connections is not a problem.
The problem arises when I want to validate whether a port is still free before I add the connection to a port. The result of respecting our transaction boundary is that (in the write model) the fact that a port already is connected might not be known to the aggregate root, but might be stored in any other aggregate root.
Of course, you could trust your client's validation, go ahead and add the connection if it's ok for the node you are adding it to and rely on a process running consistency checks to execute compensating actions for invalid connections. But that seems to be a big deal to me compared to wrapping a transaction around two ConnectTo calls...
This made me think that maybe my aggregate roots were chosen incorrectly. And I started thinking about Nodes and Networks as aggregate roots, where a Network is a collection of Connections. The good thing about a Network aggregate is that you could always validate adding or removing connections. Except when a new connection would result in the joining of two existing networks... And your aggregate could become big, possibly resulting only in a single huge network. Not feasible either.
So, how do you think this should be modeled? Do you see a solution where you respect aggregate roots as transaction boundaries, you can validate your network and you do not risk to store your entire network as a single aggregate? Or am I asking for all 3 CAP's here and is it simply not possible?
I think your "new way" is flawed, since the View model should not produce an Exception that propagates "somehow" back to the domain model. The domain model need to resolve this by itself.
So, in this case (bind 1-to-1) you could utilize events within the domain model, so that
NodeA.connect( "port1" ).to( NodeB ).on( "port3" );
NodeA reserves "port1" on itself.
NodeA sends a "portConnectionRequest" to NodeB.
NodeB binds "port3" if available.
NodeB sends "portConnectionConfirmed" or "portConnectionDenied".
NodeA receieves event and acts accordingly.
The above assumes reliable messaging, which is easily achieved within the JVM, but much harder in a distributed environment, yet that is where you want it more. If a reliable messaging system can not be provided, I think you will have a Byzantine Agreement Problem problem at hand, or a subset of it.
Ok, I read and thought some more about it and I guess this is the 'right' way to do it:
Before executing the ConnectTo method on Node A, you validate whether the port on Node B is still free using an eventually consistent view model as your data source (not the domain model which cannot validate this efficiently, see above).
ConnectTo is run only on Node A, thus no transaction boundary is violated.
If the view model is not able to connect the port on Node B, because it is already in use, a true concurrency exception has happened and it must be signaled. Some action needs to be taken (either manuel intervention or an automated process must pick it up). The probability for this concurrency exception will usually be very low.
Related
I have been reading Nathan Marz' article about how to beat the CAP theorem with the Lambda Architecture and don't understand how immutable data will make eventual consistency less complex.
The following paragraph is taken from the article:
The key is that data is immutable. Immutable data means there's no such thing as an update, so it's impossible for different replicas of a piece of data to become inconsistent. This means there are no divergent values, vector clocks, or read-repair. From the perspective of queries, a piece of data either exists or doesn't exist. There is just data and functions on that data. There's nothing you need to do to enforce eventual consistency, and eventual consistency does not get in the way of reasoning about the system.
Imagine the following example: I have a distributed insert-only database with two nodes A and B and both hold the record [timestamp=1; id=1; value=10]. Then at the same time, there is an insert against node A which results in [timestamp=2; id=1; value=20] and a read against node B for record with id=1.
How is solving the problem of eventual consistency less complex with that example than for databases with update possibility?
I'm not 100% I got it right, but I'll try to explain anyway.
Consider an example - you have 2 databases accepting writes/reads, connected with a network link. The link goes down, resulting in a network partition. We want our system to be CAP available, so we accept writes/reads in both databases.
When working with mutable data structures: suppose a client, connected to the 1st database, wants to update value for record X to A and another client, connected to the 2nd database, wants to update that value to B. Since our system is available, we accept both writes in both databases, but we will have to resolve the conflict once the network parittion is gone. This will result in one of the update being lost.
With immutable data structures, you wouldn't update the data but insert, so both writes would be there after the network paritition is gone. You'd still need some kind of time synchronization though in order to preserve the operation order which can be very tricky (see the comment in the article from Sebastien Diot).
The Neo4j documentation says:
Even though all relationships have a direction they are equally well
traversed in both directions so there's no need to create duplicate
relationships in the opposite direction (with regard to traversal or
performance).
I'm not sure how relations are implemented in Neo4j, but if incoming and outgoing relations are kept in separate sets, even though they are traversed well equally yet how you design your relations can affect the performance.
So I guess my question is, does the direction of a relation affect the performance of a Graph database in a global scheme and if it does, then how should I decide on it? E.g. does keeping the number of incoming and outgoing relations balanced help?
Relationship directionality does not affect performance.
On disk, a node record just keeps a reference to the record for its "first" relationship (either incoming or outgoing). Traversal of relationship paths is done mainly through the relationship records. The full details are too complex to merit discussion here, but relationship data is stored symmetrically with respect to directionality. So, there is no need to worry about balancing relationship directions.
I'm looking for an algorithm where groups of connected nodes can be merged together to form a new group (by creating links between nodes from different groups). And where a group can be partitioned to form new partitions.
Unlike with consensus style of membership protocols (e.g. the one described in the Raft paper) where only one group can remain after the partition, I'd like each new partition to form a new group.
Also I'd like that for each partition, each of its members is going to agree on who belongs to that partition with a strong consistency guarantee.
Or put differently, I'd like the following property to hold: After a group undergoes a membership change, if two nodes that belonged to the original group can still communicate (there is a path between the two), they should agree on the sequence of changes that happened to the group.
My understanding is that the fact that each new partition is going to agree on a different set of members in a sense implies that the Consistency part from the CAP theorem is relaxed. Giving me hope that such protocol may exist(?).
No consensus protocol (such as Paxos, Raft etc.) can be leveraged to develop a multi-group membership protocol. This is due to the fact that all consensus protocols are based on the fundamental idea that any decision can be taken only if the majority of members have "agreed-accepted" it. In this way, the "split-brain" phenomenon is avoided, because there could be no 2 partitions (of size greater than the majority: (n/2)+1) that have agreed on a different leader (and thus member set), since at least one member would be member of both partitions and would have voted for only one of the partitions (the one that asked first for vote).
One protocol that could possibly be leveraged to create a multi-group membership protocol is the Virtual Synchrony. However, note that virtual synchrony is a protocol used to send messages to (statically) predefined process groups, aka to the currently existing members of these groups. As a result, it is not designed for cases, where new process groups should be created (dynamically) at each new partition. Also note that the virtual synchrony is a protocol that does not scale to bigger members, since the message latency is proportional to the groups size.
I believe that by using the virtual synchrony protocol, you could develop such a membership protocol, which could satisfy the condition
After a group undergoes a membership change, if two nodes that belonged to the original group can still communicate (there is a path between the two), they should agree on the sequence of changes that happened to the group
However, note that this membership is not strongly consistent in a strict sense, because the failure of a node might be propagated inside the group eventually. Nonetheless, the message deliveries (which is what matters most) will be delivered in a way that ensures these deliveries obey the memberships of the group. This is achieved through imposing order on message deliveries in the members side.
Another alternative approach for a membership protocol are gossip-based membership protocols, with real-life implementations being integrated in various tools in industry, such as Consul. In order to leverage this approach, you could emit multiple different classes of messages from each member, depending on the different groups that you would like to monitor. However, again these groups are statically defined inside the protocol and eventually consistent (meaning that every failure will be finally detected by all live members).
As a conclusion, I think that a strongly-consistent membership protocol is not feasible in a strict definition, since you cannot distinguish between a member that has failed and a member that is responding really-really slow (basis of FLP and CAP theorem).
To take the simplest possible example:
Start with an empty database.
Add a document
Add a design document with validation function that rejects everything
Replicate that database.
To ask a concrete question to begin with, one with an answer that I hope can be given very quickly by pointing me to the right url: is the result of this replication defined by some rule, for example that the documents are always replicated in the order they were saved, or does the successful replication of the first document depend on whether the design document happened to arrive at the destination first? In the quick experiment I did, both documents did get successfully validated, but I'm trying to find out if that outcome is defined in a spec somewhere or it's implementation dependent.
To ask a followup question that's more handwavey and may not have a single answer, what else can happen and what sorts of solutions have emerged to manage those problems? It's obviously possible for different servers to simultaneously (and I use that word hesitantly) have different versions of a validation function. I suppose the validators could be backwards compatible, where every new version adds a case to a switch statement that looks up a say a schema_version attribute of the document. Then if a version 2 document arrives at a server where the version 3 validator is the gatekeeper, it'll be allowed in. If a version 3 document arrives at a version 2 validator, it's a bit more tricky, it presumably depends on whether strictness or leniency is an appropriate default for the application. But can either of those things even happen, or do the replication rules ensure that even if servers are going up and down, updates and deletes are being done all over the place, and replication connections are intermittent and indirect, that a document will never arrive on a given server before its appropriate validation function, and that a validation function will never arrive too late to handle one of the documents it was supposed to check?
I could well be overcomplicating this or missing out on some Zen insight, but painful experience has taught me that I'm not clever enough to predict what sorts of states concurrent systems can get themselves into.
EDIT:
As Marcello says in a comment, updates on individual servers have sequence numbers, and replication applies the updates in sequence number order. I had a vague idea that that was the case, but I'm still fuzzy on the details. I'm trying to find the simplest possible model that will give me an idea about what can and can't happen in a complex CouchDB system.
Suppose I take the state of server A that's started off empty and has three document writes made to it. So its state can be represented as the following string:
A1,A2,A3
Suppose server B also has three writes: B1,B2,B3
We replicate A to B, so the state of B is now: B1,B2,B3,A1,A2,A3. Although presumably the A updates have taken a sequence number on entering B, so the state is now: B1, B2, B3, B4(A1), B5(A2), B6(A3).
If I understand correctly, the replicator also makes a record of the fact that everything up to A3 has been replicated to B, and it happens to store this record as part of B's internal state, but I'm wondering if this is an implementation detail that can be disregarded in the simple model.
If you operate those sets of rules, the A updates and the B updates would stay in order on any server they were replicated to. Perhaps the only way they could get out of order is if you did something like replicating A to B, deleting A1 on A and A2 on B, replicating A to C, then replicating B to C, leaving a state on C of: A2, A3, B1, B2, B3, B4(A1).
Is this making any sense at all? Maybe strings aren't the right way of visualising it, maybe it's better to think of, I don't know, a bunch of queues (servers) in an airport , airport staff (replicators) moving people from queue to queue according to certain rules , and put yourself into the mind of someone trying to skip the queue, ie somehow get into a queue before someone who's ahead of them in their current queue. That has the advantage of personalising the model, but we probably don't want to replicate people in airports.
Or maybe there's some way of expliaining it as a Towers of Hanoi type game, although with FIFO queues instead of LIFO stacks.
It's a model I'm hoping to find - absolutely precise as far as behavior is concerned, all irrelevant implementation details stripped away, and using whatever metaphor or imagery makes it easiest to intuit.
The basic use case is simple. CouchDB uses sequence numbers to index database changes and to ask what changes need to be replicated. Order is implicit in this algorithm and what you fear should not happen. As a side note, the replication process only copies the last revision of a document, but this does not change anything about order.
I've been developing an in-house DSP application (Java w/ hooks for Groovy/Jython/JRuby, plugins via OSGi, plenty of JNI to go around) in data flow/diagram style, similar to pure data and simulink. My current design is a push model. The user interacts with some source component causing it to push data onto the next component and so on until a end block (typically a display or file writer). There are some unique challenges with this design specifically when a component starves for input. There is no easy way to request more input. I have mitiated some of this with feedback control flow, ex an FFT block can broadcast that it needs more data to source block of it's chain. I've contemplated adding support for components to be either push/pull/both.
I'm looking for responses regarding the merits of push vs pull vs both/hybrid. Have you done this before? What are some of the "gotchas"? How did you handle them? Is there a better solution for this problem?
Some experience with a "mostly-pull" approach in a large-scale product:
Model: Nodes build a 1:N tree, i.e. each component (except the root) has 1 parent and 1..N children. Data flows almost exclusively from parent to children. Change notifications can originate from any node in the tree.
Implementation: All leafs are notified with the sending node's id and a "generation" counter. Leafs know which node path they depend on, so they know if they need to update. (Any other child node update algorithm would do, too, and might have been better in hindsight).
Leafs query their parent for current data, query bubbles up recursively. The generation counter is included, so the bubble-up stops at the originating node.
Advantages:
parent nodes don't need much/any information about their children. Data can be consumed by anyone - this allowed a generic approach to implementing some (initially not expected) non-UI functionality on top of the data intended for display
Child nodes can aggregate and delay updates (avoiding repaints sure beats fast painting)
inactive leafs do cause no data traffic at all
Disadvantages:
Incremental updates are expensive, as full data is published.
The implementation actually allows for different data packets to be requested (and
the generation counter could prevent unecessary data traffic), but the data packets initially designed are very large. Slicing them was an afterthought, but works ok.
You need a real good generation mechanism. The one initially implemented collided with initial updates (that need special handling - see "incremental updates") and aggregation of updates
the need for data travelling up the tree was greatly underestimated.
publish is cheap only when the node offers read-only access to current data. This might require additional update synchronization, though
sometimes you want intermediate nodes to update, even when all leafs are inactive
some leafs ended up implementing polling, some base nodes ended up relying on that. ugly.
Generally:
Data-Pull "feels" more native to me when data and processing layer should know nothing about the UI. However, it requires a complex change notificatin mechanism to avoid "Updating the universe".
Data-Push simplifies incremental updates, but only if the sender intimately knows the receiver.
I have no experience of similar scale using other models, so I can't really make a recommendation. Looking back, I see that I've mostly used pull, which was less of a hassle. It would be interesting to see other peoples experiences.
I work on a pure-pull image processing library. It's more geared to batch-style operations where we don't have to deal with dynamic inputs and for that it seems to work very well. Pull works especially well for large data sets and for threading: we scale linearly to at least 32 CPUs (depending on the graph being evaluated, of course, heh).
We have a GUI that allows leaves to be dynamic data sources (for example, a video camera delivering frames) and they are handled by throwing away and rebuilding the relevant parts of the graph on a change. This is cheap in our case, so the overhead isn't that high.