Embedding data from another document in Cosmos DB

Embedding data from another document in Cosmos DB - microservices

When embedding data from another document in another collection, what is the best practice as to who should be responsible to populate that data in a microservice architecture?
As an example, let's say I have basic information about an organization:
{
"id" : 1,
"legalName": "Initech"
}
which I want to embed in an invoice like this to avoid doing two service requests to show the invoice:
{
"type": "Payable",
"invoiceStatus": "Preparing Preliminary Version",
"applicablePeriod": {
"startDateTime": "2020-07-08T00:10:59.618Z",
"endDateTime": "2020-07-08T00:10:59.618Z"
},
"issuedDateTime": "2020-07-08T00:10:59.618Z"
"issuingOrganization":{
"id": 1,
"legalName": "Initech"
}
}
Would it be the caller's responsibility to supply the data while creating/updating the invoice or would it be the invoice service that would retrieve the external data using the organization id and then embed the data as necessary?
I feel like I should avoid having cross service dependencies in the backend as much as possible. I understand the maintenance of the embedded data could be achieved through the change feed but I was wondering about the initial population of the embedded data.

Did you get an answer back on this? I at least, wanted to provide an answer to serve as general guidance. It comes down to the state of the data. Please see the following document which covers this specific topic in greater detail: Data modeling in Azure Cosmos DB
In general, use embedded data models when (link):
There are contained relationships between entities.
There are one-to-few relationships between entities.
There is embedded data that changes infrequently.
There is embedded data that will not grow without bound.
There is embedded data that is queried frequently together.
Embedding data works nicely for many cases but there are scenarios when denormalizing your data will cause more problems than it is worth. So what do we do now?
When to reference (link):
In general, use normalized data models when:
Representing one-to-many relationships.
Representing many-to-many relationships.
Related data changes frequently.
Referenced data could be unbounded.
Hybrid data models (link).
We've now looked embedding (or denormalizing) and referencing (or normalizing) data, each have their upsides and each have compromises as we have seen.
It doesn't always have to be either or, don't be scared to mix things up a little.
Based on your application's specific usage patterns and workloads there may be cases where mixing embedded and referenced data makes sense and could lead to simpler application logic with fewer server round trips while still maintaining a good level of performance.
So, with the above Data Model information, the other half of the equation is Identifying microservice boundaries and designing a microservices architecture but, in a simpler scenario...the invoice service would perform an update to the root document, either through embedding the invoice or linking the invoice.

Related

Where in the stack to best merge analytical data-warehouse data with data scraped+cached from third-party APIs?

Background information
We sell an API to users, that analyzes and presents corporate financial-portfolio data derived from public records.
We have an "analytical data warehouse" that contains all the raw data used to calculate the financial portfolios. This data warehouse is fed by an ETL pipeline, and so isn't "owned" by our API server per se. (E.g. the API server only has read-only permissions to the analytical data warehouse; the schema migrations for the data in the data warehouse live alongside the ETL pipeline rather than alongside the API server; etc.)
We also have a small document store (actually a Redis instance with persistence configured) that is owned by the API layer. The API layer runs various jobs to write into this store, and then queries data back as needed. You can think of this store as a shared persistent cache of various bits of the API layer's in-memory state. The API layer stores things like API-key blacklists in here.
Problem statement
All our input data is denominated in USD, and our calculations occur in USD. However, we give our customers the query-time option to convert the response just-in-time to another currency. We do this by having the API layer run a background job to scrape exchange-rate data, and then cache it in the document store. Individual API-layer nodes then do (in-memory-cached-with-TTL) fetches from this exchange-rates key in the store, whenever a query result needs to be translated into a specific currency.
At first, we thought that this unit conversion wasn't really "about" our data, just about the API's UX, and so we thought this was entirely an API-layer concern, where it made sense to store the exchange-rates data into our document store.
(Also, we noticed that, by not pre-converting our DB results into a specific currency on the DB side, the calculated results of a query for a particular portfolio became more cache-friendly; the way we're doing things, we can cache and reuse the portfolio query results between queries, even if the queries want the results in different currencies.)
But recently we've been expanding into also allowing partner clients to also execute complex data-science/Business Intelligence queries directly against our analytical data warehouse. And it turns out that they will also, often, need to do final exchange-rate conversions in their BI queries as well—despite there being no API layer involved here.
It seems like, to serve the needs of BI querying, the exchange-rate data "should" actually live in the analytical data warehouse alongside the financial data; and the ETL pipeline "should" be responsible for doing the API scraping required to fetch and feed in the exchange-rate data.
But this feels wrong: the exchange-rate data has a different lifecycle and integrity constraints than our financial data. The exchange rates are dirty and ephemeral point-in-time samples attained by scraping, whereas the financial data is a reliable historical event stream. The exchange rates get constantly updated/overwritten, while the financial data is append-only. Etc.
What is the best practice for serving the needs of analytical queries that need to access backend "application state" for "query result presentation" needs like this? Or am I wrong in thinking of this exchange-rate data as "application state" in the first place?

What I find interesting about your scenario is about when the exchange rate data is applicable.
In the case of the API, it's all about the realtime value in the other currency and it makes sense to have the most recent value in your API app scope (Redis).
However, I assume your analytical data warehouse has tables with purchases that were made at a certain time. In those cases, the current exchange rate is not really relevant to the value of the transaction.
This might mean that you want to store the exchange rate history in your warehouse or expand the "purchases" table to store the values in all the currencies at that moment.

Multi-tenant database. One collection or one db per tenant?

For a multi-tenancy architecture for a web application using a document-oriented database I can see two conceivable options:
Having one database per tenant, and the collections logically separate different kinds of object.
Having one collection per tenant, and all user data is stored in one database, with some kind of flag or object type identifier on each record.
Have there been any studies or has any documentation been produced regarding these two options and the differences between them?
Is there a particular standard or good reason why someone designing a web application which allows multiple users to store vastly different kinds of data would choose one over the other?
Aside from speed/efficiency issues, are there any other things to be said about this that would influence the decision?
EDIT I'm aware some of the terminology might be database specific, so for all wondering I am specifically referring to MongoDB.

I wouldn't want tenant specific collections. In my application, I usually hard code collection names, in the same way as I'd hardcode table names if I were using SQL tables. There'd be one comments collection that stores all comments for a blog. I would not want to deal with collection names like comments_tenant_1 and comments_tenant_2, because 1) that feels error prone, and 2) would make the application code more complicated (collection names would have to be replaced with functions that computed the collection name). And 3) the number of collections in a single database could grow huge, which would make a list of all collections look daunting, and also MongoDB isn't built for having very many collections (see the link in the comment below your question, which David B posted, https://docs.mongohq.com/use-cases/multi-tenant.html).
However, database names aren't coupled to application data structures, and you can grant permissions on databases (but not on single collections). So one database per tenant could be reasonable. As could be a per document tenant_id field in a single database for all tenants (see the above-mentioned link).

A Spring DAO that can adapt to changes in the data

For application developers, I suppose the traditional paradigm for writing an application with domain objects that can be persisted to an underlying data store (SQL database for arguments sake), is to write the domain objects and then write (or generate) the table structure. There is a tight coupling between what the domain object looks like and what the structure of underlying data store looks like. So if you want to add a piece of information to your domain object, you add the field to your code and then add a column to the appropriate database table. All familiar?
This is all well and good for data stores that have a well defined structure (I'm mainly talking about SQL databases whereby the tables and columns are pre-defined and fixed), but now a number of alternatives to the ubiquitous SQL database exist and these often do not constrain the data in this way. For instance, MongoDB is a NoSQL database whereby you divide data into collections but aside from that there is no structuring of the data. You don't define new columns when you want to add a new field.
Now to the question: given the flexibility of a data store like MongoDB, how would one go about achieving a similar kind of flexibility in the domain objects that represent this data? So for instance if I'm using Spring and creating my own domain obejcts, when I add a "middleName" field to my data, how can I avoid having to add a "middleName" field to my domain object? I'm looking for some kind of mechanism/approach/framework to dynamically inspect the data and have access to it in my domain object without having to make a code change every time. All ideas welcome.

I think you have a couple of choices:
You can use a dynamic programming language and not have domain objects (clojure for example)
If you're fixed on using java, the mongo java driver returns data in DBObject which is essentially a Map. So the default behavior already provides what you want. It's only when you map the DBObject into domain objects, using a library like morphia (or spring-data), that you even have to worry about domain objects at all.
But, if I was using java, I would stick with the standard convention of domain objects mapped via morphia, because I think adding a field is a very minor inconvenience when compared against the benefits.

I think the question is inherintly paradoxical-
On one hand, you want to have domain objects, i.e. objects that represent the data (and behaviour) of your problem domain.
On the other hand, you say that you don't want your domain objects to be explicitly influenced by changes to the data.
But when you have objects that represent your problem domain, you want to do just that- to represent your problem domain.
So that if, for example, middle name is added, then your representation of the real-life 'User' entity should change to accomodate this change to the real-life user; perhaps not only by adding this piece of data to your object, but also adding some related behaviour (validation of middle name, or some functionality related to it).
In essense, what I'm trying to say here is that when you have (classic OO) domain objects, you may need to change your behaviour / functionality along with your data, and since you don't have any automatic way of changing your behaviour, the question of automatically changing your data becomes irrelevant.
If you don't want behaviour associated with your data, then you essentialy have DTOs, and #Kevin's answer is what you're looking for.

Honestly, it sounds more like you're looking for some kind of blackbox DTO where, like you describe, fields are added or removed "arbitrarily" depending on the data. This makes me inclined to suggest a simple Map to do the job. You can't really have a domain-driven design if your domain model is constantly changing.

Should i create the model classes based on the structure of data in Database?

I have predefined tables in the database based on which I have to develop a web application.
Should I base my model classes on the structure of data in the tables.
But a problem is that the tables are very poorly defined and there is much redundant data in them (which I can not change!).
Eg. in 2 tables three columns are same.
Table: Student_details
Student_id , Name, AGe, Class ,School
Table :Student_address
Student_id,Name,Age, Street1,Street2,City

I think you should make your models in a way that would be best suited for how they will be used. Don't worry about how the data is stored or where it is stored... otherwise why go through the trouble of layering your code. Why not just do the direct DB query right in your view? So if you are going to create an abstraction of your data... "model" ... make one that is designed around how it will be used... not how it will be or is persisted.

This seems like a risky project - presumably, there's another application somewhere which populates these tables. As the data model is not very sound from a relational point of view, I'm guessing there's a bunch of business/data logic glued into that app - for instance, putting the student age into the StudentAddress table.
I'd support jsobo in recommending you build your business logic independently of the underlying persistance mechanism, and that you try to keep your models as domain focused as possible, without too much emphasis on how the database happens to be structured.
You should, however, plan on spending a certain amount of time translating your domain models into their respective data representations and dealing with whatever quirks the data model imposes. I'd strongly recommend containing all this stuff in a separate translation layer - don't litter it throughout the rest of the application.

UI-centric vs domain-centric data model - pros and cons

How closely does your data model map to your UI and domain model?
The data model can be quite close to the domain model if it has, for example, a Customer table, an Employee table etc.
The UI might not reflect the data model so closely though - for example, there may be multiple forms, all feeding in bits-and-pieces of Customer data along with other miscellaneous bits of data. In this case, one could you have separate tables to hold the data from each form. As required the data can then combined at a future point... Alternatively one could insert the form data directly into a Customer table, so that the data model does not correlate well to the UI.
What has proven to work better for you?

I find it cleaner to map your domain model to the real world problem you are trying to solve.
You can then create viewmodels which act as a bucket of all the data required by your view.
as stated, your UI can change frequently, but this does not usually change the particular domain problem you are tackling...
information can be found on this pattern here:
http://blogs.msdn.com/dphill/archive/2009/01/31/the-viewmodel-pattern.aspx

UI can change according to many needs, so it's generally better to keep data in a domain model, abstracted away from any one UI.

If I have a RESTful service layer, what they are exposing the domain model. In that case , the UI(any particular screen) calls a number of these services and from the domain models collected composes the screen. In this scenario although domain models bubble all the way up to UI the UI layer skims out the necessary data to build its particular screen. There are also some interesting questions on SO about on using domain model(annotated) for persistence.
My point here is the domain models can be a single source of truth. It can do the work of carrying data , encapsulating logic fairly well. I have worked on projects which had a lot of boilerplate code translating each domain model to DTO, VO , DO and what-have-yous. A lot of that looked quite unnecessary and more due to habit in most cases.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio