For a multi-tenancy architecture for a web application using a document-oriented database I can see two conceivable options:
Having one database per tenant, and the collections logically separate different kinds of object.
Having one collection per tenant, and all user data is stored in one database, with some kind of flag or object type identifier on each record.
Have there been any studies or has any documentation been produced regarding these two options and the differences between them?
Is there a particular standard or good reason why someone designing a web application which allows multiple users to store vastly different kinds of data would choose one over the other?
Aside from speed/efficiency issues, are there any other things to be said about this that would influence the decision?
EDIT I'm aware some of the terminology might be database specific, so for all wondering I am specifically referring to MongoDB.
I wouldn't want tenant specific collections. In my application, I usually hard code collection names, in the same way as I'd hardcode table names if I were using SQL tables. There'd be one comments collection that stores all comments for a blog. I would not want to deal with collection names like comments_tenant_1 and comments_tenant_2, because 1) that feels error prone, and 2) would make the application code more complicated (collection names would have to be replaced with functions that computed the collection name). And 3) the number of collections in a single database could grow huge, which would make a list of all collections look daunting, and also MongoDB isn't built for having very many collections (see the link in the comment below your question, which David B posted, https://docs.mongohq.com/use-cases/multi-tenant.html).
However, database names aren't coupled to application data structures, and you can grant permissions on databases (but not on single collections). So one database per tenant could be reasonable. As could be a per document tenant_id field in a single database for all tenants (see the above-mentioned link).
Related
currently I'm developing user-search application where users can do a full-text search. It should be extremely fast and there can be a lot of users, like 100.000. There are also like 10.000 user groups. Now I came across Solr and started to implement this, but it seems like I'm failing at the design level.
The requirements:
There is a default schema which is applied to all user groups
Each user is assigned to exactly one user group
A user group can have additional fields (besides the default schema) which should be displayed in the result set (so they can extend the data with custom data)
The search should be extremely fast
How would you realize that application that suits the requirements?
First, I thought about creating a "master core" for the default schema and create a core for each user group, so that I could join the necessary cores when a user requests the data. But it seems like that joining cores in standalone would not work because it does not support sharding. However, even if it would work, I'm concerned about performance because of joining at query time.
SolrCloud does seem to support sharding, but again, I would need to join the queries to one result set which would impact performance again. Additionally, I came across this post Query multiple collections with different fields in solr which says that I would need a merged schema (share-unification) to be able to query across collections/shards. So this would mean: whenever a user group's schema is changed, I would need to change my share-unifacation. As all user group's schemas rely on the share-unification, the search would be unavailable because I would need to re-index at least two schemas.
A simple solution would be to put everything into a single core (standalone) or collection (cloud), but this feels overwhelming.
Has someone did something similar before and can give a good advice or even a best practice?
I need to know what is the recommended solution when I want to index my solr data using multiple queries and entities.
I ask because I have to add a new fields into schema.xml configuration. And depends of entity(query) there should be different fields definition.
query_one = "select * from car"
query_two = "select * fromm user"
Tables car and user have differents fields, so I should include this little fact in my schema.xml config (when i will be preparing fields definition).
Maybe someone of you creates a new solr instance for that kind of problem ?
I found something what is call MultiCore. Is it alright solution for my problem ?
Thanks
Solr does not stop you to host multiple entities in a single collection.
You can define the fields for both the entities and have them hosted within the Collection.
You would need to have an identifier to identify the Entities, if you want to filter the results per entity.
If your collections are small or there is a relationship between the User and Car it might be helpful to host them within the same collection
For Solr Multicore Check Answer
Solr Multicore is basically a set up for allowing Solr to host multiple cores.
These Cores which would host a complete different set of unrelated entities.
You can have a separate Core for each table as well.
For e.g. If you have collections for Documents, People, Stocks which are completely unrelated entities you would want to host then in different collections
Multicore setup would allow you to
Host unrelated entities separately so that they don't impact each other
Having a different configuration for each core with different behavior
Performing activities on each core differently (Update data, Load, Reload, Replication)
keep the size of the core in check and configure caching accordingly
Its more a matter of preference and requirements.
The main question for you is whether people will search for cars and users together. If not (they are different domains), you can setup multiple collections/cores. If they are going to be used together (e.g. a search for something that shows up in both cars and people), you may want to merge them into one index.
If you do use single collection for both types, you may want to setup dedicated request handlers returning different sets of fields and possibly tuning the searches. You can see an example of doing that (and a bit more) in the multilingual example from my book.
In our project we're trying to apply the Bounded Context ideology and we've faced kind of obvious problem of performance. E.g., we have different classes (in different contexts) for representing a user in the system: Person in our core domain's context and User in security context. So, we have two different repositories for each of the aggregate, but they are using the same table in DB and sometimes accessing the same data.
Is there common solution to minimize db roundtrips in this case? Are there ORM's which deals with it, or should we code some caching system by ourselves?
upd: the db is from legacy app, and we'll have to use it "as is"
So, we have two different repositories for each of the aggregate, but
they are using the same table in DB and sometimes accessing the same
data.
The fact that you have two aggregates stored in the same table is an indication of a problem with the design. In this case, it seems you have two bounded contexts - a BC for the core domain (Person is here) and an identity/access BC (User is here). The BCs are related and the latter can be seen as upstream from the former. A Person in the core domain has a corresponding User in the identity BC, but they are not exactly the same thing.
Beyond this relationship between the BCs there are questions regarding ownership of behavior. For example, both a Person and a User may have a name and what is to be determined is who own's the behavior of changing a name. This can be implemented in several ways. Person may have its own name and changes should be propagated to the identity BC. Similarly, User may own changes to name, in which case they must be propagated to Person via a synchronization mechanism.
Overall, your problem could be addressed in two ways. First, you can store Person and User aggregates in different tables. Any given query should only use one of these tables and they can be synchronized in an eventually consistent matter. Another approach is to decouple the behavioral domain model from a model designed for queries (read-model). This way, you can create a read-model designed to serve a specific screen(s) and have a customized query, perhaps even outside of an ORM.
If all the Users are Person too (sometimes external services are modeled as special users too), the only data that User and Person should share on the database are their identifiers.
Indeed each entity in a domain model should hold references only to the data that they need to ensure their invariants.
Moreover I guess that Users are identified by Username and Persons are identified by something else (VAT code or so..).
Thus, the simplest optimization technique is to avoid to encapsulate in an entity those informations that are not required to ensure its invariants.
Furthermore you simply need an effective context mapping technique to easily pass from User to Person when needed. I use shared identifiers for this.
As an example you can expose the Person's identifier in the User class, so that a simple query to the Person's repository can provide you the data you need.
Finally I suggest you the Vaughn Vernon series on Aggregate Root Design.
My latest project deals with a lot of "staging" data.
Like when a customer registers, the data is stored in "customer_temp" table, and when he is verified, the data is moved to "customer" table.
Before I start shooting e-mails, go on a rampage on how I think this is wrong and you should just put a flag on the row, there is always a chance that I'm the idiot.
Can anybody explain to me why this is desirable?
Creating 2 tables with the same structure, populating a table (table 1), then moving the whole row to a different table (table 2) when certain events occur.
I can understand if table 2 will store archival, non seldom used data.
But I can't understand if table 2 stores live data that can changes constantly.
To recap:
Can anyone explain how wrong (or right) this seemingly counter-productive approach is?
If there is a significant difference between a "customer" and a "potential customer" in the business logic, separating them out in the database can make sense (you don't need to always remember to query by the flag, for example). In particular if the data stored for the two may diverge in the future.
It makes reporting somewhat easier and reduces the chances of treating both types of entities as the same one.
As you say, however, this does look redundant and would probably not be the way most people design the database.
There seems to be several explanations about why would you want "customer_temp".
As you noted would be for archival purposes. To allow analyzing data but in that case the historical data should be aggregated according to some interesting query. However it using live data does not sound plausible
As oded noted, there could be a certain business logic that differentiates between customer and potential customer.
Or it could be a security feature which requires logging all attempts to register a customer in addition to storing approved customers.
Any time I see a permenant table names "customer_temp" I see a red flag. This typically means that someone was working through a problem as they were going along and didn't think ahead about it.
As for the structure you describe there are some advantages. For example the tables could be indexed differently or placed on different File locations for performance.
But typically these advantages aren't worth the cost cost of keeping the structures in synch for changes (adding a column to different tables searching for two sets of dependencies etc. )
If you really need them to be treated differently then its better to handle that by adding a layer of abstraction with a view rather than creating two separate models.
I would have used a single table design, as you suggest. But I only know what you posted about the case. Before deciding that the designer was an idiot, I would want to know what other consequences, intended or unintended, may have followed from the two table design.
For, example, it may reduce contention between processes that are storing new potential customers and processes accessing the existing customer base. Or it may permit certain columns to be constrained to be not null in the customer table that are permitted to be null in the potential customer table. Or it may permit write access to the customer table to be tightly controlled, and unavailable to operations that originate from the web.
Or the original designer may simply not have seen the benefits you and I see in a single table design.
I am developing a website that will manage data for multiple entities. No data is shared between entities, but they may be owned by the same customer. A customer may want to manage all their entities from a single "dashboard". So should I have one database for everything, or keep the data seperated into individual databases?
Is there a best-practice? What are the positives/negatives for having a:
database for the entire site (entity
has a "customerID", data has
"entityID")
database for each
customer (data has "entityID")
database for each entity (relation of
database to customer is outside of
database)
Multiple databases seems like it would have better performance (fewer rows and joins) but may eventually become a maintenance nightmare.
Personally, I prefer separate databases, specifically a database for each entity. I like this approach for the following reasons:
Smaller = faster regarding the queries.
Queries are simpler.
No risk of ever accidentally displaying one customer's data to another.
One database could pose a performance bottleneck as it gets large (# of entities increase). You get a sort of build in horizontal scalability with 1 per entity.
Easy data clean up as customers or entities are removed.
Sure it'll take more time to upgrade the schema, but in my experience modifications are fairly uncommon once you deploy and additions are trivial.
I think this is hard to answer without more information.
I lean on the side of one database. Properly coded business objects should prevent you from forgetting clientId in your queries.
The type of database you are using and how it scales might help you make your decision.
For schema changes down the road, it seems one database would be easier from a maintenance perspective - you have one place to make them.
What about backup and restore? Could you experience a customer wanting to restore a backup for one of their entities?
This is a fairly normal scenario in multi-tenant SAAS applications. Both approaches have their pros and cons. Search on best practices for multi-tenant SAAS (software as a service) and you will find tons of stuff to ponder upon.
Check out this article on Microsoft's site. I think it does a nice job of laying out the different costs and benefits associated with Multi-Tenant designs. Also look at the Multi tenancy article on wikipedeia. There are many trade offs and your best match greatly depends on what type of product you are developing.
One good argument for keeping them in separate databases is that its easier to scale (you can simply have multiple installations of the server, with the client databases distributed across the servers).
Another argument is that once you are logged in, you don't need to add an extra where check (for client ID) in each of your queries.
So, a master DB backed by multiple DBs for each client may be a better approach,
If the client would ever need to restore only a single entity from a backup and leave the others in their current state, then the maintenance will be much easier if each entity is in a separate database. if they can be backed up and restored together, then it may be easier to maintain the entities as a single database.
I think you have to go with the most realistic scenario and not necessarily what a customer "may" want to do in the future. If you are going to market that feature (i.e. seeing all your entities in one dashboard), then you have to either find a solution (maybe have the dashboard pull from multiple databases) or use a single database for the whole app.
IMHO, having the data for multiple clients in the same database just seems like a bad idea to me. You'll have to remember to always filter your queries by clientID.
It also depends on your RDBMS e.g.
With SQL server databases are cheep
With Oracle it is easy to partition tables by customer "customerID", so a single large database can run as fast as a small database for each customer.
However witch every you choose, try to hide it as a low level in your data access code
Do you plan to have your code deployed to multiple environments?
If so, then try to keep it within one database and have all table references prefixed with a namespace from a configuration file.
The single database option would make the maintenance much easier.