In DDD, where should validation logic for pagination queries reside?
For example, if the service layer receives a query for a collection with parameters that look like (in Go), though feel free to answer in any language:
// in one file
package repositories
type Page struct {
Limit int
Offset int
}
// Should Page, which is part of the repository
// layer, have validation behaviour?
func (p *Page) Validate() error {
if p.Limit > 100 {
// ...
}
}
type Repository interface {
func getCollection(p *Page) (collection, error)
}
// in another file
package service
import "repositories"
type Service struct {
repository repositories.Repository
}
// service layer
func (s *Service) getCollection(p *repositories.Page) (pages, error) {
// delegate validation to the repository layer?
// i.e - p.Validate()
// or call some kind of validation logic in the service layer
// i.e - validatePagination(p)
s.repository.getCollection(p)
}
func validatePagination(p *Page) error {
if p.Limit > 100 {
...
}
}
and I want to enforce a "no Limit greater than 100" rule, does this rule belong in the Service layer or is it more of a Repository concern?
At first glance it seems like it should be enforced at the Repository layer, but on second thought, it's not necessarily an actual limitation of the repository itself. It's more of a rule driven by business constraints that belongs on the entity model. However Page isn't really a domain entity either, it's more a property of the Repository layer.
To me, this kind of validation logic seems stuck somewhere between being a business rule and a repository concern. Where should the validation logic go?
The red flag for me, is the same one identified by #plalx. Specifically:
It's more of a rule driven by business constraints that belongs on the
entity model
In all likelihood, one of two things are happening. The less likely of the two is that the business users are trying to define the technical application the domain model. Every once in a while, you have a business user who knows enough about technology to try to interject these things, and they should be listened to - as a concern, and not a requirement. Use cases should not define performance attributes, as those are acceptance criteria of the application, itself.
That leads into the more likely scenario, in that the business user is describing pagination in terms of the user interface. Again, this is something that should be talked about. However, this is not a use case, as it applies to the domain. There is absolutely nothing wrong with limiting dataset sizes. What is important is how you limit those sizes. There is an obvious concern that too much data could be pulled back. For example, if your domain contains tens of thousands of products, you likely do not want all of those products returned.
To address this, you should also look at why you have a scenario that can return too much data in the first place. When looking at it purely from a repository's perspective, the repository is used simply as a CRUD factory. If your concern is what a developer could do with a repository, there are other ways to paginate large datasets without bleeding either a technological or application concern into the domain. If you can safely deduce that the aspect of pagination is something owned by the implementation of the application, there is absolutely nothing wrong with having the pagination code outside of the domain completely, in an application service. Let the application service perform the heavy lifting of understanding the application's requirement of pagination, interpreting those requirements, and then very specifically telling the domain what it wants.
Instead of having some sort of GetAll() method, considering having a GetById() method that takes an array of identifiers. Your application service performs a dedicated task of "searching" and determining what the application is expecting to see. The benefit may not be immediately apparent, but what do you do when you are searching through millions of records? If you want to considering using something like a Lucene, Endeca, FAST, or similar, do you really need to crack the domain for that? When, or if, you get to the point where you want to change out a technical detail and you find yourself having to actually touch your domain, to me, that is a rather large problem. When your domain starts to serve multiple applications, will all of those application share the same application requirements?
The last point is the one that I find hits home the most. Several years back, I was in the same situation. Our domain had pagination inside of the repositories, because we had a business user who had enough sway and just enough technical knowledge to be dangerous. Despite the objections of the team, we were overruled (which is a discussion onto itself). Ultimately, we were forced to put pagination inside of the domain. The following year, we started to use the domain within the concept of other application's inside of the business. The actual business rules never changed, but the way that we searched did - depending on the application. That left us having to bring up another set of methods to accommodate, with the promise of reconciliation in the future.
That reconciliation came with the fourth application to use the domain, which was for an external third-party to consume, when we finally conveyed the message that these continual changes in the domain could have been avoided by allowing the application to own its own requirements and providing a means to facilitate a specific question - such as "give me these specific products". The previous approach of "give me twenty products, sorted in this fashion, with a specific offset" in no way described the domain. Each application determined what a "pagination" ultimately meant to itself and how it wanted to load those results. Top result, reversing order in the middle of a paged set, etc. Those were all eliminated because those were moved nearer their actual responsibilities and we empowered the application while still protecting the domain. We used the service layer as a delineation for what is considered "safe". Since the service layer acted as a go-between, between the domain and the application, we could reject a request at the service-level if, for example, the application requested more than one hundred results. This way, the application could not just do whatever it pleased, and the domain was left gleefully oblivious to the technical limitation being applied to the call being made.
"It's more of a rule driven by business constraints that belongs on
the entity model"
These kind of rules generally aren't business rules, they are simply put in place (most likely by developers without business experts involvement) due to technical system limitations (e.g. guarantee the system's stability). They usually find their natural home in the Application layer, but could be placed elsewhere if it's more practical to do so.
On the other hand, if business experts are interested by the resource/cost factor and decide to market this so that customers may pay more to view more at a time then that would become a business rule; something the business really cares about.
In that case the rule checking would certainly not go in the Repository because the business rules would get buried in the infrastructure layer. Not only that but the Repository is very low-level and may be used in automated scripts or other processes where you would not want these limitations to apply.
Actually, I usually apply some CQRS principles and avoid going through repositories entirely for queries, but that's another story.
At first glance it seems like it should be enforced at the Repository
layer, but on second thought, it's not necessarily an actual
limitation of the repository itself. It's more of a rule driven by
business constraints that belongs on the entity model.
Actually repositories are still domain. They're mediators between the domain and data mapping layer. Thereby, you should still consider them as domain.
Therefore, a repository interface implementation should enforce domain rules.
In summary, I would ask myself: do I want to allow non-paginated access to abstracted data by the repository from any domain operation?. And the answer should be probably not, because such domain might own thousands of domain objects, and it would be a suboptimal retrieval trying to get too many domain objects at once, wouldn't be?
Suggestion
* Since I don't know which language is currently using the OP, and I find that programming language doesn't matter on this Q&A, I'll explain a possible approach using C# and the OP can translate it to any programming language.
For me, enforcing a no more than 100 results per query rule should be a cross-cutting concern. In opposite to what #plalx has said on his answer, I really believe that something that can be expressed in code is the way to go and it's not only an optimization concern, but a rule enforced to the entire solution.
Based on what I've said above, I would design a Repository abstract class to provide some common behaviors and rules across the entire solution:
public interface IRepository<T>
{
IList<T> List(int skip = 0, int take = 0);
// Other method definitions like Add, Remove, GetById...
}
public abstract class Repository<T> : IRepository<T>
{
protected virtual void EnsureValidPagination(int skip = 0, int take = 0)
{
if(take > 100)
{
throw new ArgumentException("take", "Cannot take more than 100 objects at once");
}
}
public IList<T> List(int skip = 0, int take = 0)
{
EnsureValidPagination(skip, take);
return DoList<T>(skip, take);
}
protected abstract IList<T> DoList(int skip = 0, int take = 0);
// Other methods like Add, Remove, GetById...
}
Now you would be able to call EnsureValidPagination in any implementation of IRepository<T> that would also inherit Repository<T>, whenever you implement an operation which involves returning object collections.
If you need to enforce such rule to some specific domain, you could just design another abstract class deriving some like I've described above, and introduce the whole rule there.
In my case, I always implement a solution-wide repository base class and I specialize it on each domain if needed, and I use it as base class to specific domain repository implementations.
Answering to some #guillaume31 comment/concern on his answer
I agree that it isn't a domain-specific rule. But Application and
Presentation aren't domain either. Repository is probably a bit too
sweeping and low-level for me -- what if a command line data utility
wants to fetch a vast number of items and still use the same domain
and persistence layers as the other applications?
Imagine you've defined a repository interface as follows:
public interface IProductRepository
{
IList<Product> List(int skip = 0, int take = 0);
}
An interface wouldn't define a limitation on how many products I can take at once, but see the following implementation to IProductRepository:
public class ProductRepository : IRepository
{
public ProductRepository(int defaultMaxListingResults = -1)
{
DefaultMaxListingResults = defaultMaxListingResults;
}
private int DefaultMaxListingResults { get; }
private void EnsureListingArguments(int skip = 0, int take = 0)
{
if(take > DefaultMaxListingResults)
{
throw new InvalidOperationException($"This repository can't take more results than {DefaultMaxListingResults} at once");
}
}
public IList<Product> List(int skip = 0, int take = 0)
{
EnsureListingArguments(skip, take);
}
}
Who said we need to harcode the maximum number of results that can be taken at once? If the same domain is consumed by different application layers I see you wouldn't be able to inject different constructor parameters depending on particular requirements by these application layers.
I see same service layer injecting exactly the same repository implementation with different configurations depending on the consumer of the whole domain.
Not a technical requirement at all
I want to throw my two cents on some consensus made by other answerers, which I believe that are partially right.
The consensus is a limitation like the one required by the OP is a technical requirement rather than a business requirement.
BTW, it seems like no one has put the focus on the fact that domains can talk to each other. That is, you don't design your domain and other layers to support the more traditional execution flow: data <-> data mapping <-> repository <-> service layer <-> application service <-> presentation (this is just a sample flow, it might be variants of it).
Domain should be bullet proof in all possible scenarios or use cases on which it'll be consumed or interacted. Hence, you should consider the following scenario: domain interactions.
We shouldn't be less philosophical and more ready to see the real world scenario, and the whole rule can happen in two ways:
The entire project isn't allowed to take more than 100 domain objects at once.
One or more domains aren't allowed to take more than 100 domain objects at once.
Some argue that we're talking about a technical requirement, but for me is a domain rule because it also enforces good coding practices. Why? Because I really think that, at the end of the day, there's no chance that you would want to get an entire domain object collection, because pagination has many flavors and one is the infinite scroll pagination which can be also be applied to command-line interfaces and simulate the feel of a get all operation. So, force your entire solution to do things right, and avoid get all operations, and probably the domain itself will be implemented differently than when there's no pagination limitation.
BTW, you should consider the following strategy: the domain enforces that you couldn't retrieve more than 100 domain objects, but any other layer on top of it can also define a limit lower than 100: you can't get more than 50 domain objects at once, otherwise the UI would suffer performance issues. This won't break the domain rule because the domain won't cry if you artificially limit what you can get within the range of its rule.
Probably in the Application layer, or even Presentation.
Choose Application if you want that rule to hold true for all front ends (web, mobile app, etc.), Presentation if the limit has to do with how much a specific device is able to display on screen at a time.
[Edit for clarification]
Judging by the other answers and comments, we're really talking about defensive programming to protect performance.
It cannot be in the Domain layer IMO because it's a programmer-to-programmer thing, not a domain requirement. When you talk to your railway domain expert, do they bring up or care about a maximum number of trains that can be taken out of any set of trains at a time? Probably not. It's not in the Ubiquitous Language.
Infrastructure layer (Repository implementation) is an option but as I said, I find it inconvenient and overly restrictive to control things at such a low level. Matías's proposed implementation of a parameterized Repository is admittedly an elegant solution though, because each application can specify their own maximum, so why not - if you really want to apply a broad sweeping limit on XRepository.GetAll() to a whole applicative space.
Related
I am trying to understand which of the following two options is the right approach and why.
Say we have GetHotelInfo(hotel_id) API that is being invoked from the Web till the Controller.
The logic of the GetHotelInfo is:
Invoke GetHotelPropertyData() (Location, facilities…)
Invoke GetHotelPrice(hotel_id, dates…)
Invoke GetHotelReviews(hotel_id)
Once all results come back, process and merge the data and return 1 object that contains all relevant data of the hotel.
Option 1:
Create 3 different repositories (HotelPropertyRepo, HotelPriceRepo,
HotelReviewsRepo)
Create GetHotelInfo usecase that will use these 3 repositories and
return the final result.
Option 2:
Create 3 different repositories (HotelPropertyRepo, HotelPriceRepo,
HotelReviewsRepo)
Create 3 different usecases (GetHotelPropertyDataUseCase,
GetHotelPriceUseCase, GetHotelReviewsUseCase)
Create GetHotelInfoUseCase that will orchestrate the previous 3
usecases. (It can also be a controller, but that’s a different topic)
Let’s say that right now only GetHotelInfo is being exposed to the Web but maybe in the future, I will expose some of the inner requests as well.
And would the answer be different if the actual logic of GetHotelInfo is not a combination of 3 endpoints but rather 10?
You can see a similar method (called Get()) in "Clean Architecture with GO" from Manato Kuroda
Manato points out that:
following Acyclic Dependencies Principle (ADP), the dependencies only point inward in the circle, not point outward and no circulation.
that Controller and Presenter are dependent on Use Case Input Port and Output Port which is defined as an interface, not as specific logic (the details). This is possible (without knowing the details in the outer layer) thanks to the Dependency Inversion Principle (DIP).
That is why, in example repository manakuro/golang-clean-architecture, Manato creates for the Use cases layer three directories:
repository,
presenter: in charge of Output Port
interactor: in charge of Input Port, with a set of methods of specific application business rules, depending on repository and presenter interface.
You can use that example, to adapt your case, with GetHotelInfo declared first in hotel_interactor.go file, and depending on specific business method declared in hotel_repository, and responses defined in hotel_presenter
Is expected Interactors (Use Case class) call other interactors. So, both approaches follow Clean Architecture principles.
But, the "maybe in the future" phrase goes against good design and architecture practices.
We can and should think the most abstract way so that we can favor reuse. But always keeping things simple and avoiding unnecessary complexity.
And would the answer be different if the actual logic of GetHotelInfo is not a combination of 3 endpoints but rather 10?
No, it would be the same. However, as you are designing APIs, in case you need the combination of dozens of endpoints, you should start considering put a GraphQL layer instead of adding complexity to the project.
Clean is not a well-defined term. Rather, you should be aiming to minimise the impact of change (adding or removing a service). And by "impact" I mean not only the cost and time factors but also the risk of introducing a regression (breaking a different part of the system that you're not meant to be touching).
To minimise the "impact of change" you would split these into separate services/bounded contexts and allow interaction only through events. The 'controller' would raise an event (on a shared bus) like 'hotel info request', and each separate service (property, price, and reviews) would respond independently and asynchronously (maybe on the same bus), leaving the controller to aggregate the results and return them to the client, which could be done after some period of time. If you code the result aggregator appropriately it would be possible to add new 'features' or remove existing ones completely independently of the others.
To improve on this you would then separate the read and write functionality of each context into its own context, each responding to appropriate events. This will allow you to optimise and scale the write function independently of the read function. We call this CQRS.
Let's say I've got a domain class, which has functions, that are to be called in a sequence. Each function does its job but if the previous step in the sequence is not done yet, it throws an error. The other way is that each function completes the step required for it to run, and then executes its own logic. I feel that this way is not a good practice, since I am adding multiple responsibilities, and the caller wont know what all operations can happen when he invokes a method.
My question is, how to handle dependent scenarios in DDD. Is it the responsibility of the caller to invoke the methods in the right sequence? Or do we make the methods handle the dependent operations before it's own logic?
Is it the responsibility of the caller to invoke the methods in the right sequence?
It's ok if those methods have a business meaning. For example the client may book a flight, and then book a hotel room. Both of those is something the client understands, and it is the client's logic to call them in this sequence. On the other hand, inserting the reservation into the database, then committing (or whatever) is technical. The client should not have to deal with that at all. Or "initializing" an object, then calling other methods, then calling "close".
Requiring a sequence of technical calls is a form of temporal coupling, it is considered a bad practice, and is not directly related to DDD.
The solution is to model the problem better. There is probably a higher level use-case the caller wants achieved with this call sequence. So instead of publishing the individual "steps" required, just support the higher use-case as a whole.
In general you should always design with the goal to get any sequence of valid calls to actually mean something (as far as the language allows).
Update: A possible model for the mentioned "File" domain:
public interface LocalFile {
RemoteFile upload();
}
public interface RemoteFile {
RemoteFile convert(...);
LocalFile download();
}
From my point of view, what you are describing is the orchestration of domain model operations. That's the job of the application layer, the layer upon domain model. You should have an application service that would call the domain model methods in the right sequence, and it also should take into account whether some step has left any task undone, and in such case, tell the next step to perform it.
TLDR; Scroll to the bottom for the answer, but the backstory will give some good context.
If the caller into your domain must know the order in which to call things, then you have missed an opportunity to encapsulate business logic in your domain, which is a symptom of an anemic domain.
#RobertBräutigam made a very good point:
Requiring a sequence of technical calls is a form of temporal coupling, it is considered a bad practice, and is not directly related to DDD.
This is true, but it is worse when you do it with your domain model because non-domain concerns get intermixed with domain concerns. Intent becomes lost in a sea of non business logic. If you can, you look for a higher-order aggregate that encapsulates the ordering. To borrow Robert's example, rather than booking a flight then a hotel room, and forcing that on the client, you could have a Vacation aggregate take both and validate it.
I know that sounds wrong in your case, and I suspect you're right. There's a clear dependency that can't happen all at once, so we can't be the end of the story. When you have a clear dependency with intermediate transactions that must occur before the "final" state, we have... orchestration (think sagas, distributed transactions, domain events and all that goodness).
What you describe with file operations spans across transactions. The manipulation (state change) of a domain is transactional at each point in a distributed transaction, but is not transactional overall. So when #choquero70 says
you are describing is the orchestration of domain model operations. That's the job of the application layer, the layer upon domain model.
that's also correct. Orchestration is key. Each step must manipulate the state of the domain once, and once only, and leave it in a valid state, but it OK for there to be multiple steps.
Each of those individual points along the timeline are valid moments in the state of your domain.
So, back to your model. If you expose a single interface with multiple possible calls to all steps, then you leave yourself open to things being called out of order. Make this impossible or at least improbable. Orchestration is not just about what to do, but what to prevent from happening. Create smaller interfaces/classes to avoid accidentally increasing the "surface area" of what could be misused accidentally.
In this way, you are guiding the caller on what to do next by feeding them valid intermediate states. But, and this is the important part, the burden on what to call in what order is not on the caller. Sure, the caller could know what to do, but why force it.
Your basic algorithm is the same: upload, transform, download.
Is it the responsibility of the caller to invoke the methods in the right sequence?
Not exactly. Is the responsibility of the caller to choose from legitimate choices given the state of your domain. It's "your" responsibility to present these choices via business methods on your correctly modeled moment/interval aggregate suitable for the caller to use.
Or do we make the methods handle the dependent operations before it's own logic?
If you've setup orchestration correctly, this won't be necessary. But it does make sense to validate anyway.
On a side note, each step of the orchestration you do should be very linear in nature. I tell my developers to be suspicious of an orchestration step that has an if statement in it. If there's an if it's likely better to be part of another orchestration step or encapsulated in business logic.
graphql schema like this:
type User {
id: ID!
location: Location
}
type Location {
id: ID!
user: User
}
Now, the client sends a graphql query. Theoretically, the User and Location can circular reference each other infinitely.
I think it's an anti-pattern. For my known, there is no middleware or way to limit the nesting depth of query both in graphql and apollo community.
This infinite nesting depth query will cost a lot of resources for my system, like bandwidth, hardware, performance. Not only server-side, but also client-side.
So, if graphql schema allow circular reference, there should be some middlewares or ways to limit the nesting depth of query. Or, add some constraints for the query.
Maybe do not allow circular reference is a better idea?
I prefer to sending another query and doing multiple operations in one query. It's much more simple.
Update
I found this library: https://github.com/slicknode/graphql-query-complexity. If graphql doesn't limit circular reference. This library can protect your application against resource exhaustion and DoS attacks.
It depends.
It's useful to remember that the same solution can be a good pattern in some contexts and an antipattern in others. The value of a solution depends on the context that you use it. — Martin Fowler
It's a valid point that circular references can introduce additional challenges. As you point out, they are a potential security risk in that they enable a malicious user to craft potentially very expensive queries. In my experience, they also make it easier for client teams to inadvertently overfetch data.
On the other hand, circular references allow an added level of flexibility. Running with your example, if we assume the following schema:
type Query {
user(id: ID): User
location(id: ID): Location
}
type User {
id: ID!
location: Location
}
type Location {
id: ID!
user: User
}
it's clear we could potentially make two different queries to fetch effectively the same data:
{
# query 1
user(id: ID) {
id
location {
id
}
}
# query 2
location(id: ID) {
id
user {
id
}
}
}
If the primary consumers of your API are one or more client teams working on the same project, this might not matter much. Your front end needs the data it fetches to be of a particular shape and you can design your schema around those needs. If the client always fetches the user, can get the location that way and doesn't need location information outside that context, it might make sense to only have a user query and omit the user field from the Location type. Even if you need a location query, it might still not make sense to expose a user field on it, depending on your client's needs.
On the flip side, imagine your API is consumed by a larger number of clients. Maybe you support multiple platforms, or multiple apps that do different things but share the same API for accessing your data layer. Or maybe you're exposing a public API designed to let third-party apps integrate with your service or product. In these scenarios, your idea of what a client needs is much blurrier. Suddenly, it's more important to expose a wide variety of ways to query the underlying data to satisfy the needs of both current clients and future ones. The same could be said for an API for a single client whose needs are likely to evolve over time.
It's always possible to "flatten" your schema as you suggest and provide additional queries as opposed to implementing relational fields. However, whether doing so is "simpler" for the client depends on the client. The best approach may be to enable each client to choose the data structure that fits their needs.
As with most architectural decisions, there's a trade-off and the right solution for you may not be the same as for another team.
If you do have circular references, all hope is not lost. Some implementations have built-in controls for limiting query depth. GraphQL.js does not, but there's libraries out there like graphql-depth-limit that do just that. It'd be worthwhile to point out that breadth can be just as large a problem as depth -- regardless of whether you have circular references, you should look into implementing pagination with a max limit when resolving Lists as well to prevent clients from potentially requesting thousands of records at a time.
As #DavidMaze points out, in addition to limiting the depth of client queries, you can also use dataloader to mitigate the cost of repeatedly fetching the same record from your data layer. While dataloader is typically used to batch requests to get around the "n+1 problem" that arises from lazily loading associations, it can also help here. In addition to batching, dataloader also caches the loaded records. That means subsequent loads for the same record (inside the same request) don't hit the db but are fetched from memory instead.
TLDR; Circular references are an anti-pattern for non-rate-limited GraphQL APIs. APIs with rate limiting can safely use them.
Long Answer: Yes, true circular references are an anti-pattern on smaller/simpler APIs ... but when you get to the point of rate-limiting your API you can use that limiting to "kill two birds with one stone".
A perfect example of this was given in one of the other answers: Github's GraphQL API let's you request a repository, with its owner, with their repositories, with their owners ... infinitely ... or so you might think from the schema.
If you look at the API though (https://developer.github.com/v4/object/user/) you'll see their structure isn't directly circular: there are types in-between. For instance, User doesn't reference Repository, it references RepositoryConnection. Now, RepositoryConnection does have a RepositoryEdge, which does have a nodes property of type [Repository] ...
... but when you look at the implementation of the API: https://developer.github.com/v4/guides/resource-limitations/ you'll see that the resolvers behind the types are rate-limited (ie. no more than X nodes per query). This guards both against consumers who request too much (breadth-based issues) and consumers who request infinitely (depth-based issues).
Whenever a user requests a resource on GitHub it can allow circular references because it puts the burden on not letting them be circular onto the consumer. If the consumer fails, the query fails because of the rate-limiting.
This lets responsible users ask for the user, of the repository, owned by the same user ... if they really need that ... as long as they don't keep asking for the repositories owned by the owner of that repository, owned by ...
Thus, GraphQL APIs have two options:
avoid circular references (I think this is the default "best practice")
allow circular references, but limit the total nodes that can be queried per call, so that infinite circles aren't possible
If you don't want to rate-limit, GraphQL's approach of using different types can still give you a clue to a solution.
Let's say you have users and repositories: you need two types for both, a User and UserLink (or UserEdge, UserConnection, UserSummary ... take your pick), and a Repository and RepositoryLink.
Whenever someone requests a user via a root query, you return the User type. But that User type would not have:
repositories: [Repository]
it would have:
repositories: [RepositoryLink]
RepositoryLink would have the same "flat" fields as Repository has, but none of its potentically circular object fields. Instead of owner: User, it would have owner: ID.
The pattern you show is fairly natural for a "graph" and I don't think it's especially discouraged in GraphQL. The GitHub GraphQL API is the thing I often look at when I wonder "how do people build larger GraphQL APIs", and there are routinely object cycles there: a Repository has a RepositoryOwner, which can be a User, which has a list of repositories.
At least graphql-ruby has a control to limit nesting depth. Apollo doesn't obviously have this control, but you might be able to build a custom data source or use the DataLoader library to avoid repeatedly fetching objects you already have.
The above answers provide good theoretical discussion on the question. I would like to add more practical considerations that occur in software development.
As #daniel-rearden points out, a consequence of circular references is that it allows for multiple query documents to retrieve the same data. In my experience, this is a bad practice because it makes client-side caching of GraphQL requests less predictable and more difficult, since a developer would have to explicitly specify that the documents are returning the same data in a different structure.
Furthermore, in unit testing, it is difficult to generate mock data for objects whose fields/properties contain circular references to the parent. (at least in JS/TS; if there are languages that support this easily out-of-the-box, I'd love to hear it in a comment)
Maintenance of a clear data hierarchy seems to be the clear choice for understandable and maintainable schemas. If a reference to a field's parent is frequently needed, it is perhaps best to build a separate query.
Aside: Truthfully, if it were not for the practical consequences of circular references, I would love to use them. It would be beautiful and amazing to represent data structures as a "mathematically perfect" directed graph.
In my limited experience, I've been told repeatedly that you should not pass around entities to the front end or via rest, but instead to use a DTO.
Doesn't Spring Data Rest do exactly this? I've looked briefly into projections, but those seem to just limit the data that is being returned, and still expecting an entity as a parameter to a post method to save to the database. Am I missing something here, or am I (and my coworkers) incorrect in that you should never pass around and entity?
tl;dr
No. DTOs are just one means to decouple the server side domain model from the representation exposed in HTTP resources. You can also use other means of decoupling, which is what Spring Data REST does.
Details
Yes, Spring Data REST inspects the domain model you have on the server side to reason about the way the representations for the resources it exposes will look like. However it applies a couple of crucial concepts that mitigate the problems a naive exposure of domain objects would bring.
Spring Data REST looks for aggregates and by default shapes the representations accordingly.
The fundamental problem with the naive "I throw my domain objects in front of Jackson" is that from the plain entity model, it's very hard to reason about reasonable representation boundaries. Especially entity models derived from database tables have the habit to connect virtually everything to everything. This stems from the fact that important domain concepts like aggregates are simply not present in most persistence technologies (read: especially in relational databases).
However, I'd argue that in this case the "Don't expose your domain model" is more acting on the symptoms of that than the core of the problem. If you design your domain model properly there's a huge overlap between what's beneficial in the domain model and what a good representation looks like to effectively drive that model through state changes. A couple of simple rules:
For every relationship to another entity, ask yourself: couldn't this rather be an id reference. By using an object reference you pull a lot of semantics of the other side of the relationship into your entity. Getting this wrong usually leads entities referring to entities referring to entities, which is a problem on a deeper level. On the representation level this allows you to cut off data, cater consistency scopes etc.
Avoid bi-directional relationships as they're notoriously hard to get right on the update side of things.
Spring Data REST does quite a few things to actually transfer those entity relationships into the proper mechanisms on the HTTP level: links in general and more importantly links to dedicated resources managing those relationships. It does so by inspecting the repositories declared for entities and basically replaces an otherwise necessary inlining of the related entity with a link to an association resource that allows you to manage that relationship explicitly.
That approach usually plays nicely with the consistency guarantees described by DDD aggregates on the HTTP level. PUT requests don't span multiple aggregates by default, which is a good thing as it implies a scope of consistency of the resource matching the concepts of your domain.
There's no point in forcing users into DTOs if that DTO just duplicates the fields of the domain object.
You can introduce as many DTOs for your domain objects as you like. In most of the cases, the fields captured in the domain object will reflect into the representation in some way. I have yet to see the entity Customer containing a firstname, lastname and emailAddress property, and those being completely irrelevant in the representation.
The introduction of DTOs doesn't guarantee a decoupling by no means. I've seen way too many projects where they where introduced for cargo-culting reasons, simply duplicated all fields of the entity backing them and by that just caused additional effort because every new field had to be added to the DTOs as well. But hey, decoupling! Not. ¯\_(ツ)_/¯
That said, there are of course situations where you'd want to slightly tweak the representation of those properties, especially if you use strongly typed value objects for e.g. an EmailAddress (good!) but still want to render this as a plain String in JSON. But by no means is that a problem: Spring Data REST uses Jackson under the covers which offers you a wide variety of means to tweak the representation — annotations, mixins to keep the annotations outside your domain types, custom serializers etc. So there is a mapping layer in between.
Not using DTOs by default is not a bad thing per se. Just imagine the outcry by users about the amount of boilerplate necessary if we required DTOs to be written for everything! A DTO is just one means to an end. If that end can be achieved in a different way (and it usually can), why insist on DTOs?
Just don't use Spring Data REST where it doesn't fit your requirements.
Continuing on the customization efforts it's worth noticing that Spring Data REST exists to cover exactly the parts of the API, that just follow the basic REST API implementation patterns it implements. And that functionality is in place to give you more time to think about
How to shape your domain model
Which parts of your API are better expressed through hypermedia driven interactions.
Here's a slide from the talk I gave at SpringOne Platform 2016 that summarizes the situation.
The complete slide deck can be found here. There's also a recording of the talk available on InfoQ.
Spring Data REST exists for you to be able to focus on the underlined circles. By no means we think you can build a great really API solely by switching Spring Data REST on. We just want to reduce the amount of boilerplate for you to have more time to think about the interesting bits.
Just like Spring Data in general reduces the amount of boilerplate code to be written for standard persistence operations. Nobody would argue you can actually build a real world app from only CRUD operations. But taking the effort out of the boring bits, we allow you to think more intensively about the real domain challenges (and you should actually do that :)).
You can be very selective in overriding certain resources to completely take control of their behavior, including manually mapping the domain types to DTOs if you want. You can also place custom functionality next to what Spring Data REST provides and just hook the two together. Be selective about what you use.
A sample
You can find a slightly advanced example of what I described in Spring RESTBucks, a Spring (Data REST) based implementation of the RESTBucks example in the RESTful Web Services book. It uses Spring Data REST to manage Order instances but tweaks its handling to introduce custom requirements and completely implement the payment part of the story manually.
Spring Data REST enables a very fast way to prototype and create a REST API based on a database structure. We're talking about minutes vs days, when comparing with other programming technologies.
The price you pay for that, is that your REST API is tightly coupled to your database structure. Sometimes, that's a big problem. Sometimes it's not. It depends basically on the quality of your database design and your ability to change it to suit the API user needs.
In short, I consider Spring Data REST as a tool that can save you a lot of time under certain special circumstances. Not as a silver bullet that can be applied to any problem.
We used to use DTOs including the fully traditional layering ( Database, DTO, Repository, Service, Controllers,...) for every entity in our projects. Hopping the DTOs will some day save our life :)
So for a simple City entity which has id,name,country,state we did as below:
City table with id,name,county,.... columns
CityDTO with id,name,county,.... properties ( exactly same as database)
CityRepository with a findCity(id),....
CityService with findCity(id) { CityRepository.findCity(id) }
CityController with findCity(id) { ConvertToJson( CityService.findCity(id)) }
Too many boilerplate codes just to expose a city information to client. As this is a simple entity no business is done at all along these layers, just the objects is passing by.
A change in City entity was starting from database and changed all layers. (For example adding a location property, well because at the end the location property should be exposed to user as json). Adding a findByNameAndCountryAllIgnoringCase method needs all layers be changed changed ( Each layer needs to have new method).
Considering Spring Data Rest ( of course with Spring Data) this is beyond simple!
public interface CityRepository extends CRUDRepository<City, Long> {
City findByNameAndCountryAllIgnoringCase(String name, String country);
}
The city entity is exposed to client with minimum code and still you have control on how the city is exposed. Validation, Security, Object Mapping ... is all there. So you can tweak every thing.
For example, if I want to keep client unaware on city entity property name change (layer separation), well I can use custom Object mapper mentioned https://docs.spring.io/spring-data/rest/docs/3.0.2.RELEASE/reference/html/#customizing-sdr.custom-jackson-deserialization
To summarize
We use the Spring Data Rest as much as possible, in complicated use cases we still can go for traditional layering and let the Service and Controller do some business.
A client/server release is going to publish at least two artifacts. This already decouples client from server. When the server's API is changed, applications do not immediately change. Even if the applications are consuming the JSON directly, they continue to consume the legacy API.
So, the decoupling is already there. The important thing is to think about the various ways a server's API is likely to evolve after it is released.
I primarily work with projects which use DTOs and numerous rigid layers of boilerplate between the server's SQL and the consuming application. Rigid coupling is just as likely in these applications. Often, changing anything in the DB schema requires us to implement a new set of endpoints. Then, we support both sets of endpoints along with the accompanying boilerplate in each layer (Client, DTO, POJO, DTO <-> POJO conversions, Controller, Service, Repository, DAO, JDBC <-> POJO conversion, and SQL).
I'll admit that there is a cost to dynamic code (like spring-data-rest) when doing anything not supported by the framework. For example, our servers need to support a lot of batch insert/update operations. If we only need that custom behavior in a single case, it's certainly easier to implement it without spring-data-rest. In fact, it may be too easy. Those single cases tend to multiply. As the number of DTOs and accompanying code grows, the inconsistencies eventually become extremely burdensome to maintain. In some non-dynamic server implementations, we have hundreds of DTOs and POJOs that are likely no longer used by anything. But, we are forced to continue supporting them as their number grows each month.
With spring-data-rest, we pay the cost of customization early. With our multi-layer hard-coded implementations, we pay it later. Which one is preferred depends on a lot of factors (including the team's knowledge and the expected lifetime of the project). Both types of project can collapse under their own weight. But, over time, I've become more comfortable with implementations (like spring-data-rest without DTOs) that are more dynamic. This is especially true when the project lacks good specifications. Over time, such a project can easily drown in the inconsistencies buried within its sea of boilerplate.
From the Spring documentation I don't see Spring data REST exposes entities, you are the one doing it.
Spring Data projects intend to ease the process of accessing different data sources, but you are the one deciding which layer to expose on Spring Data Rest.
Reorganizing your project will help to solve your issue.
Every #Repository that you create with Spring data represents more a DAO in the sense of design than a Repository. Each one is tightly coupled with a particular Data source you want to reach out to. Say JPA, Mongo, Redis, Cassandra,...
Those layers are meant to return entity representations or projections.
However if you check out the Repository pattern from a design perspective you should have a higher layer of abstraction from those specific DAOs where your app use those DAOs to get info from as many different sources it needs, and builds business specific objects for your app (Those might looks more like your DTOs).
That is probably the layer you want to expose on your Spring Data Rest.
NOTE: I see an answer recommending to return Entity instances only because they have the same properties as the DTO. This is normally a bad practice and in particular is a bad idea in Spring and many other frameworks because they do not return your actual classes, they return proxy wrappers so that they can work some magic like lazy loading of values and the likes.
Hopefully you'll see the problem I'm describing in the scenario below. If it's not clear, please let me know.
You've got an application that's broken into three layers,
front end UI layer, could be asp.net webform, or window (used for editing Person data)
middle tier business service layer, compiled into a dll (PersonServices)
data access layer, compiled into a dll (PersonRepository)
In my front end, I want to create a new Person object, set some properties, such as FirstName, LastName according to what has been entered in the UI by a user, and call PersonServices.AddPerson, passing the newly created Person. (AddPerson doesn't have to be static, this is just for simplicity, in any case the AddPerson will eventually call the Repository's AddPerson, which will then persist the data.)
Now the part I'd like to hear your opinion on is validation. Somewhere along the line, that newly created Person needs to be validated. You can do it on the client side, which would be simple, but what if I wanted to validate the Person in my PersonServices.AddPerson method. This would ensure any person I want to save would be validated and removes any dependancy on the UI layer doing the work. Or maybe, validate both in UI and in by business server layer. Sounds good so far right?
So, for simplicity, I'll update the PersonService.AddPerson method to perform the following validation checks
- Check if FirstName and LastName are not empty
- Ensure this new Person doesn't already exist in my repository
And this method will return True if all validation passes and the Person is persisted, False if Validation fails or if the Person is not persisted.
But this Boolean value that AddPerson returns isn't enough for me at the UI layer to give the user a clear reason why the save process failed. So what's a lonely developer to do? Ultimately, I'd like the AddPerson method to be able to ensure what its about to save is valid, and if not, be able to communicate the reasons why it's not invalid to my UI layer.
Just to get your juices flowing, some ways of solving this could be: (Some of these solutions, in my opinion, suck, but I'm just putting them there so you get an understanding of what I'm trying to solve)
Instead of AddPerson returning a boolean, it can return an int (i.e. 0 = Success, Non Zero equals failure and the number indicates the reason why it failed.
In AddPerson, throw custom exceptions when validation fails. Each type of custom exception would have its own error message. In addition, each custom exception would be unique enough to catch in the UI layer
Have AddPerson return some sort of custom class that would have properties indicating whether validation passed or failed, and if it did fail, what were the reasons
Not sure if this can be done in VB or C#, but attach some sort of property to the Person and its underlying properties. This "attached" property could contain things like validation info
Insert your idea or pattern here
And maybe another here
Apologies for the long winded question, but I definately like to hear your opinion on this.
Thanks!
Multiple layers of validation go well with multi-layer apps.
The UI itself can do the simplest and quickest checks (are all mandatory fields present, are they using the appropriate character sets, etc) to give immediate feedback when the user makes a typo.
However the business logic should have the lion's share of validation responsibilities... and for once it's not a problem if this is "repetitious", i.e., if the business layer re-checks something that should already have been checked in the UI -- the BL should check all the business rules (this double checks on UI's correctness, enables multiple different UI clients that may not all be perfect in their checks -- e.g. a special client on a smart phone which may not have good javascript, and so on -- and, a bit, wards against maliciously hacked clients).
When the business logic saves the "validated" data to the DB, that layer should perform its own checks -- DBs are good at that, and, again, don't worry about some repetition -- it's the DB's job to enforce data integrity (you might want different ways to feed data to it one day, e.g. a "bulk loader" to import a number of Persons from another source, and it's key to ensure that all those ways to load data always respect data integrity rules); some rules such as uniqueness and referential integrity are really best enforced in the DB, in particular, for performance reasons too.
When the DB returns an error message (data not inserted as constraint X would be violated) to the business layer, the latter's job is to reinterpret that error in business terms and feed the results to the UI to inform the user; and of course the BL must similarly provide clear and complete info on business rules violation to the UI, again for display to the user.
A "custom object" is thus clearly "the only way to go" (in some scenarios I'd just make that a JSON object, for example). Keeping the Person object around (to maintain its "validation problems" property) when the DB refused to persist it does not look like a sharp and simple technique, so I don't think much of that option; but if you need it (e.g. to enable "tell me again what was wrong" functionality, maybe if the client went away before the response was ready and needs to smoothly restart later; or, a list of such objects for later auditing, &c), then the "custom validation-failure object" could also be appended to that list... but that's a "secondary issue", the main thing is for the BL to respond to the UI with such an object (which could also be used to provide useful non-error info if the insertion did in fact succeed).
Just a quick (and hopefully helpful) comment: when you're wondering where to place validation, try pretending that, soon, you're going to completely recreate your UI layer using a technology you're not yet so familiar with**. Try to keep out of that layer any validation-like business logic that you know for certain you'd have to rewrite in the new technology.
You'll find exceptions - business logic that ends up in your UI layer regardless, but it's a useful consideration nonetheless.
** Mobile dev, Silverlight, Voice XML, whatever - pretending you don't know the technology of your "new" UI layer helps you abstract your concerns and get less mired in implementation details.
The only important points are:
From the perspective of the front-end(s), the Middle Tier must perform all validation, you never know whether someone is going to try circumventing your front-end validation by talking directly to your Middle Tier (for whatever reason)
The Middle Tier may elect to delegate some of that validation to the DB layer (e.g. data integrity constraints)
You may optionally duplicate some validation in the UI, but that should only be for the sake of performance (to avoid round-trips to the Middle Tier for common scenarios, such as missing mandatory fields, incorrectly formatted data, etc.) These checks should never take the place of doing them in the Middle Tier
Validation should be done at all three levels.
When I am in a project I assume I am making a framework, which most of the time is not the case. Each layer is separate and must check all layers input before doing an operation
Each level can have a different way of doing it, it is not necessary they all use the same, but ideally they should all use the same validation with the ability to customize it.
You never want to let bad data into the database. So you can never trust the data you are getting from the business layer. It needs to be checked.
In the business layer you can never trust the UI layer, and you must check it to prevent un-needed calls to the database layer. The UI layer works the same way.
I disagree with David Basarab's comment that the same validations should be present in all layers. This defies the paradigm of responsibility of layers for one reason. Secondly, though the main intention is to make the layers (or components) loosely coupled, it is also important that a level of responsibility (and hence trust) is endowed on the layers. Though it might be necessary to duplicated some validations in UI and Business Layer (since UI layer can be bypassed by hacking attempts), however, it is not advisable to repeat the validations in each layer. Each layer should perform only those validations which they are responsible for. The biggest flaw in repeting validations in all layers is code redundancy, which can cause maintenance nightmare.
A lot of this is more style than substance. I personally favor returning status objects as a flexible and extensible solution. I would say that I think there are a couple classes of validation in play, the first being "does this person data conform to the contract of what a person is?" and the second being "does this person data violate constraints in the database?" I think the first validation can, and should be done at the client. The second should be done at the middle tier. With this division, you may find that the only reasons the save could fail are 1)violates a uniqueness constrains, or 2)something catastrophic. You could then return false for the first case, and throw an exception for the other.
If tier R is closer to the user (or any input stream you don't control) than tier S then tier S should validate all data received from tier R. This does not mean that tier R shouldn't validate data. It's better for the user if the GUI warns him he's making a mistake before he attempts a new transaction. But no matter how bulletproof the validation in your GUI is, the next tier up should not trust that any validation has taken place.
This assumes your database in completely under your control. If not, you have bigger problems.
Also, you could have the UI pass the data needed to build a Person object through some sort of PersonBuilder object, so that object creation is consolidated in the domain/business layer, and you can keep the Person object in a state that is always consistent. This makes more sense for more complex entities, however even for simple ones, it is good to centralize object creation, just like you centralize persistence, etc.