DM and hierarchies - dimensions for future use - dimensional-modeling

My very first DM so be gentle..
Modeling a hierarchy with ERD as follows:
Responses are my facts. All the advice I've seen indicates creating a single dimension (say dim_event) and denormalizing event, department and organization into that dimension:
What if I KNOW that there will be future facts/reports that rely on an Organization dimension, or a Department dimension that do not involve this particular fact?
It makes more sense to me (from the OLTP world) to create individual dimensions for the major components and attach them to the fact. That way they could be reused as conformed dimensions.
This way for any updating dimension attributes there would be one dim table; if I had everything denormalized I could have org name in several dimension tables.
--Update--
As requested:
An "event" is an email campaign designed to gather response data from a specific subset of clients. They log in and we ask them a series of questions and score the answers.
The "response" is the set of scores we generate from the event.
So an "event" record may look like this:
name: '2019 test Event'
department: 'finance'
"response" records look something like this:
event: '2019 test Event'
retScore: 2190
balScore: 19.98

If your organization and department are tightly coupled (i.e. department implies organization as well), they should be denormalized and created as a single dimension. If department & organization do not have a hierarchical relationship, they would be separate dimensions.
Your Event would likely be a dim (degenerate) and a fact. The fact would point to the various dimensions that describe the Event and would contain the measures about what happened at the Event (retScore, balScore).
A good way to identify if you're dealing with a dim or a fact is to ask "What do I know before any thing happens?" I expect you'd know which orgs & depts are available. You may even know certain types of recurring events (blood drive, annual fundraiser), which could also be a separate dimension (event type). But you wouldn't have any details about a specific event, HR Fundraiser 2019 (fact), until one is scheduled.
A dimension represents the possibilities, but a fact record indicates something actually happens. My favorite analogy for this is a restaurant menu vs a restaurant order. The items on the menu can be referenced even if they've never been ordered. The menu is the dimension, the order is the fact.
Hope this helps.

Related

Product price changed while creating order

What is the DDD way of handling the following scenario:
user enters Order Create screen and starts creatingnew Order with OrderItems
user chooses ProductX from products catalog and adds quantity
OrderItem for ProductX is created on Order and user goes on adding another product
in the meantime, before Order is saved, admin changes price for ProductX
Assuming Product and Order/OrderItem are separate aggregates, potentially even separate bounded contexts, how is this handled?
I can think of several options:
optimistic concurrency combined with db transactions, but then if we broaden the question to microservices where each microservice has its own db - what then?
joining everything into one giant AR but that doesn’t seem right.
introduce a business rule that no product prices are updated during the point of sales working hours but that is often not possible (time triggered discounts, e.g.)
What is the proper DDD/microservices way of solving this?
What is the proper DDD/microservices way of solving this?
The general answer is that you make time an explicit part of your pricing model. Price changes made to the product catalog have an effective date, which means that you can, by modeling time in the order, have complete agreement on what price the shopper saw at the time of the order.
This might introduce the concept of a QuotedPrice as something separate from the Catalog price, where the quote is a promise to hold a price for some amount of time.
To address this sort of problem in general, here are three important papers to review:
Memories, Guesses, and Apologies -- Pat Helland, 2007
Data on the Outside vs Data on the Inside -- Pat Helland, 2005
Race Conditions Don't Exist -- Udi Dahan, 2010
I think one way to solve this through is Events. As you said, Product and Order can are very least separate aggregates, I would keep them loosely coupled. Putting them into one single aggregate root would against Open/Close and Single Responsibility Principle.
If a Product changes it can raise a ProductChanged event and likewise of an Order.
Depending on whether these Domain-Objects are within the same service or different service you can create a Domain-Event or an Integration event. Read more about it here.
From the above link:
A domain event is, something that happened in the domain that you want other parts of the same domain (in-process) to be aware of. The notified parts usually react somehow to the events.
I think this fits perfectly to your scenario.

Event Sourcing - Aggregate modeling

Two questions
1) How to model aggregate and reference between them
2) How to organise/store events so that they can be retrieved efficiently
Take this typical use case as example, we have Order and LineItem (they are an aggregate, Order is the aggregate root), and Product aggregate.
As LineItem needs to know which Product, so there are two options 1) LineItem has direct reference to Product aggregate (which seems not a best practice, as it violate the idea of aggregate being a consistence boundary because we can update Product aggregate directly from Order aggregate) 2) then LineItem only has ProductId.
It looks like 2nd option is the way to go...What do you think here?
However, another problem arises which is about building a Order read/view model. In this Order view model, it needs to know which Products are in Order (i.e. ProductId, Type, etc.). The typical use case is reporting, and CommandHandler also can use this Product object to perform logic such as whether there are too many particular products, etc. In order to do it, given the fact that those data are in two separate aggregate, then we need 1+ database roundtrips. As we are using events to build model, so the pseudo code looks like below
1) for a given order id (guid, order aggregate id), we load all the events for it; -- 1st database access
2) then build a Order aggregate, then we know which ProductId are referenced in Order;
3) for the list of ProductIds, we load all events for it; -- 2nd database access
If we build a really big graph of objects (a lot of different aggregates), then this may end up with a few more database access (each of which is slow)...What's your idea in here?
Thanks
Take this typical use case as example, we have Order and LineItem (they are an aggregate, Order is the aggregate root), and Product aggregate.
The Order aggregate makes sense the way you have described it. "Product aggregate" is more suspicious; do you ask the model if the product is allowed to change, or are you telling the model that the product has changed?
If Product can change without first consulting with the order, then the LineItem must not include the product. A reference to the product (aka the ProductId) is ok.
If we build a really big graph of objects (a lot of different aggregates), then this may end up with a few more database access (each of which is slow)...What's your idea in here?
For reads, reports, and the like -- where you aren't going to be adding new events to the history -- one possible answer is to do the slow work in advance. An asynchronous process listens for writes in the event store, and then publishes those events to a bus. Subscribers build new versions of the reports when new events are observed, and cache the results. (search keyword: cqrs)
When a client asks for a report, you give them one out of the cache. All the work is done, so it's very quick.
For command handlers, the answer is more complicated. Business rules are supposed to be in the domain model, so having the command handler try to validate the command (as opposed to the domain model) is a bit broken.
The command handler can load the products to see what the state might look like, and pass that information to the aggregate with the command data, but it's not clear that's a good idea -- if the client is going to send a command to be run, and you need to flesh out the Order command with Product data, why not instead have the command add the product data to the command directly, and skip the middle man.
CommandHandler also can use this Product object to perform logic such as whether there are too many particular products, etc.
This example is a bit vague, but taking a guess: you are thinking about cases where you prevent an order from being placed if the available inventory is insufficient to fulfill the order.
For real world inventory - a physical book in a warehouse - that's probably the wrong approach to take. First, the model itself is wrong; if you want to know how much product is in the warehouse, you should be querying the warehouse, not the product. Second, a physical warehouse is not constrained by your model -- calling the addProduct method on a warehouse aggregate doesn't cause the product to magically appear there.
Third, it probably doesn't match very well with what your domain experts want anyway. If the model says that the warehouse doesn't have enough product, do you think the stake holders want the system to
tell the shopper to buy the product somewhere else, or...
accept the order, and contact the supplier for a new delivery.
Hint: when in doubt, carefully review how amazon.com does it.

Database schema for rewarding users for their activities

I would like to provide users with points when they do a certain thing. For example:
adding article
adding question
answering question
liking article
etc.
Some of them can have conditions like there are only points for first 3 articles a day, but I think I will handle this directly in my code base.
The problem is what would be a good database design to handle this? I think of 3 tables.
user_activities - in this table I will store event types (I use
laravel so it would probably be the event class name) and points for
specific event.
activity_user - pivot table between user_activities and users.
and of course users table
It is very simple so I am worrying that there are some conditions I haven't thought of, and it would come and bite me in the future.
I think you'll need a forth table that is simply "activities" that is simply a list of the kinds of activities to track. This will have an ID column, and then in your user_activities table include an 'activity_id' to link to that. You'll no doubt have unique information for each kind, for example an activities table may have columns like
ID : unique ID per laravel
ACTIVITY_CODE : short code to use as part of application/business logic
ACTIVITY_NAME : longer name that is for display name like "answered a question"
EVENT : what does the user have to do to trigger the activity award
POINT_VALUE: how many points for this event
etc
If you think that points may change in the future (eg. to encourage certain user activities) then you'll want to track the actual point awarded at the time in the user activities table, or some way to track what the points were at any one time.
While I'm suggesting fourth table, what you really need is more carefully worded list of features to be implemented before doing any design work. My example of allowing for points awarded to change over time is such a feature that you don't mention but you'll need to design for if this feature is needed.
Well I have found this https://laracasts.com/lessons/build-an-activity-feed-in-laravel as very good solution. Hope it helps someone :)

SQLite database design for music chart tracker

I've been putting together a little SQLite database to track the top 100 songs from the iTunes RSS feed. I've built the script in Bash to do all the hard work and it's working finally, but I'm not sure if my database structure is correct, so I'm looking for some feedback on the best way to go as I am only learning SQL as I go at the moment so I don't want to dig myself into a hole when it comes to building the queries to retrieve data in time!
I have 3 tables like so;
artists_table
artist_id - PK
artist_name
songs_table
song_id - PK
artist_id - FK (from the artists table)
charts_table
chart_id - PK
song_id - FK (from the songs table)
position - (chart position 1-100)
date - (date of chart position xxxx-xx-xx)
The artists and songs table seem good to me, got foreign key constraint working...etc but I'm not sure about the charts table, anything obviously wrong with this structure?
I want to track songs/artists/positions over time so I can generate some stats...etc
Thanks,
Initial Response
I ask you about the data, in order to answer your Question, but you keep telling me about the process. No doubt, that is very important to you. And now you wish to ensure that the Record Filing System is correct.
Personally, I never write a line of code until I have the database designed. Partly because I hate to rewrite code (and I love to code). You have the sequence reversed, an unfortunate trend these days. Which means, whatever I give you, you will have to rewrite large chunks of your code.
(b.1) How exactly does it check if the artist[song] already exists ?
(b.2) How do you know that there is NOT more than occ of a specific artist/song on file ?
Right now, given the details in your Question, let's say that you have incoming, that Pussycat Dolls place 66 on the MTV chart today:
INSERT artist VALUES ( "Pussycat Dolls" ) -- succeeds, intended
INSERT artist VALUES ( "Pussycat Dolls" ) -- succeeds, unintended
INSERT artist VALUES ( "Pussycat Dolls" ) -- succeeds, unintended
Exactly which Pussycat Dolls record placed 66th today ?
When you RFS grows, and you have more fields in artist, eg. birth_date, which of the three records would you like to update ?
Ditto for Song.
How is Chart identified, is it something like US Top 40 ?
(b.1) How exactly does it check if the artist[song] already exists ?
When you execute code, it runs inside the sqLite program. What is the exact SQL string that you pass it ? Let's say you do this:
SELECT $artist_id = artist_id
FROM artist
WHERE artist_name = $artist_name
IF $artist_id = NULL
INSERT artist VALUES ( $artist_name )
Then you have going to have a few surprises when the system goes "live". Hopefully this interaction will eliminate them. Right now you have a few hundred artists.
when you have a few thousand artists, the system will slow down to a snails pace.
when things go awry, you will have duplicate artists, songs, charts.
Record Filing System
Right now, you have a pre-1970's ISAM Record Filing System, with no Relational integrity, power, or speed.
If you would like to understand more about the dangers of an RFS, in todays Relational context, please read this Answer.
Relational Database
As I understand it, you want the integrity, power, and speed of a Relational Database. Here is what you are heading towards. Obviously, it is incomplete, unconfirmed, there are may details missing, many questions remain open. But we have to model the data, only as data (as opposed to what you are going to do with it, the process), and nothing but the data.
This approach will ensure many things:
as the data grows and is added to (in terms of structure, not population), the existing data and code will not change
you will have data and referential integrity
you can obtain each of your stats via a single SELECT command.
you can execute any SELECT against the data, even SELECTs that you are not capable of dreaming about, meaning unlimited stats. As long as the data is stored in Relational form.
A database is a collection of facts about the real world, limited to the subject area of concern. Thus far we don't have facts, we have a recording of an incoming RSS stream. And the recording has no integrity, there is nothing that your code can rely on. This is heading in the direction of facts:
First Draft Music Chart TRD (Obsolete due to progression, see below.)
Response to Comments 1
Currently, I am only tracking one chart, but I see in your model that it also has the ability to track several, that is nice!
Not really. It is a side-effect of Doing Things Properly. The issue here is one of Identification. A Chart Position is not identified by RSS Feed ID, or chart_table.id, plus a PositionNo plus a DateTime. No. A Chart Position is identified as US Top 100/27 Apr 15/1… The side effect is that ChartName is part of the Identifier, and that allows multiple Charts, with no additional coding.
In these dark days of IT, people often write systems for one Country, and implement a StateCode all over the place. And then experience massive problems when they open up to an international customer base. The point is, there is no such thing as a State that does not have a Country, a State exists only in the context of a Country. So the Identifier for State must include a Country Identifier, it is (CountryCode, StateCode). Both Australia and Canada have NT for a StateCode.
If I can explain how I store the data from the rss feed, it might clear things up somewhat.
No, please. This is about the data, and only the data. Please review my previous comments on that issue, and the benefits.
I am away from my main computer at the moment, but I will respond within the next couple of hours if thats ok.
No worries. I will get to it tomorrow.
Your model does make sense to me though,
That is because you know the data values intimately, but you do not understand the data, and when someone lays it out for you, correctly, you experience pleasurable little twitches of recognition.
I don't mind having to recode everything, its a learning curve!
That's because you put the cart before the horse, and coded against data laid out in a spreadsheet, instead of designing the database first and coding against that second.
If you are not used to the Notation, please be advised that every little tick, notch, and mark, the solid vs dashed lines, the square vs round corners, means something very specific. Refer to the IDEF1X Notation.
Response to Comments 2
Just one more quick question.
Fire away, until you are completely satisfied.
In the diagram, would there be any disadvantage to putting the artist table above the song table and making the song table a child of the parent artist instead? As artists can have many songs, but each song can only have 1 artist. Is there any need for the additional table to contain just the artistPK and songPK. Could I not store the artistPK into the songs table as a FK, as a song can only exist if there is an associated artist?
Notice your attachment to the way you had it organised. I repeat:
A database is a collection of facts about the real world, limited to the subject area of concern.
Facts are logical, not physical. When those facts are organised correctly (Normalised, designed):
You can execute any SELECT against the data, even SELECTs that you are not capable of dreaming about, meaning unlimited stats. As long as the data is stored in Relational form.
When they aren't, you cant. All SQL (not only reports that are envisioned) against the data is limited to the limitations in the model, which boils down to one thing: discrete facts being recorded in logical form, or not.
With the TRD we have progressed to recording facts about the real world, limited only by the scope of the app, and not by the non-discretion of facts.
Could I not store the artistPK into the songs table as a FK, as a song can only exist if there is an associated artist?
In your working context, at this moment, that is true. But that is not true in the real world that you are recording. If the app or your scope changes, you will have to change great slabs of the db and the app. If you record the facts correctly, as they exist, not as limited to your current app scope, no such change will be necessary when the the app or your scope changes (sure, you will have to add objects and code, but not modify existing objects and code).
In the real world, Song and Artist are discrete facts, each can exist independent of the other. Your proposition is false.
Ave Maria existed for 16 centuries before Karen Carpenter recorded it.
And you already understand and accept that an Artist exists without a `Song.
Is there any need for the additional table to contain just the artistPK and songPK.
It isn't an "additional table to contain just the artistPK and songPK", it is recording a discrete fact (separate to the independent existence of Artist and Song), that a specific Artist recorded a specific Song. That is the fact that you will count on in theChartDatePosition`
Your proposition places Song as dependent on, subordinate to, Artist, and that is simply not true. Any and all stats (dreamed of or not) that are based on Song will have to navigate Artist::ArtistSong, then sort or ORDER BY, etc.
artists can have many songs, but each song can only have 1 artist.
That is half-true (true in your current working context, but not true in the real world). The truth is:
Each Artist is independent
Each Song is independent
Each Artist recorded 1-to-n Songs (via ArtistSong)
Each Song was recorded by 1-to-n Artists (via ArtistSong)
For understanding, changing your words above to form correct propositions (as opposed to stating technically correct Predicates):
Artists can have many RecordedSongs
Each RecordedSong can only have 1 Artist
Each RecordedSong can only have 1 Song
So yes, there are disadvantages, significant ones.
Which is why I state, you must divorce yourself from the app, the usage, and model the data, as data, and nothing but data.
Solution 2
I have updated the TRD.
Second Draft Music Chart TRD
Courier means example data; blue indicates a Key (Primary is always first); pipe indicates column separation; slash indicates Alternate Key (only the columns that are not in the PK are shown); green indicates non-key.
I am now giving you the Predicates. These are very important, for many reasons. The main reason here, is that it disambiguate the issues we are discussing.
If you would like more information on Predicates, visit this Answer, scroll down (way down!) to Predicate, and read that section. Also evaluate that TRD and those Predicates against it.
The index on ChartDateSong needs explanation. At first I assumed:
PK ( Chart, Date, Rank )
But then for Integrity purposes, as well as search, we need:
AK ( Chart, Date, ArtistId, SongId )
Which is a much better PK. So I switched them. We do need both. (I don't know about NONsqLite, if it has clustered indices, the AK, not the PK should be clustered.)
PK ( Chart, Date, ArtistId, SongId )
AK ( Chart, Date, Rank )
Response to Comments 3
What about the scenario when a song enters the charts with the same song_name as a record in the song_table but is completely unrelated (not a cover, completely original, but just happens to share the same name)
In civilised countries that is called fraud, obtaining benefit by deception, but I will try to think in devilish terms for a moment and answer the question.
Well, if it happens, then you have to cater for it. How does the feed inform you of such an event ? I trust it doesn't. So then your Song Identifier is still the Name.
and instead of a unique song record being created, the existing song_id is added to the artistssongs_table with the artist id, wouldn't this be a problem?
We don't know any better, so it is not a problem. No one watching that feed knows any better either. If and when you receive data informing you of that issue, through whatever channel, and you can specify it, you can change it.
Normally we have an app that allows us to navigate the hierarchies, and to change them, eg. A ReferenceMaintenance app, with an Exporer-type window on the left, and combo dialogues (list of occs on top, plus detail of one occ on the bottom) on the right .
Until then, it is not a form of corruption, because the constraint that prevents such corruption is undefined. You can't be held guilty of breaking a law that hasn't been written yet. Except in rogue states.
Although a song can have the same name, it doesn't necessarily mean it's the same record.
Yes.
Wouldn't it be better to differentiate a song by the artist?
They are differentiated by Artist.
You do appreciate that the fact of a Song, and the fact of an Artist playing a song, are two discrete facts, yes ? Please question any Predicates that do not mean perfect sense, those are the propositions that the database supports.
Ave Maria exists as an independent fact, in Song
Karen Carpenter, Celine Dion, and Yours Truly exist as three independent facts, in Artist
Karen Carpenter-Ave Maria, Celine Dion-Ave Maria, and Yours Truly-Ave Maria exist as three discrete facts in ArtistSong.
That is seven separate facts, about one Song, about three Artists.
Response to Comments 4
I do understand it now. The artistsong_table is where the 2 items "meet" and a relationship actually exists and is unique.
Yes. I just wouldn't state it in that way. The term Fact has a technically precise meaning, over and above the English meaning.
A database is a collection of facts about the real world, limited to the subject area of concern.
Perhaps read my Response 3 again, with that understanding of Fact in mind.
Each ArtistSong row is a Fact. That depends on the Fact of an Artist, and the Fact of a Song. It establishes the Fact that that Artist recorded that Song. And that ArtistSong Fact is one that other Facts, lower in the hierarchy, will depend upon.
"Relationship ... actually". I think you mean "instance". The relationship exists between the tables, because I drew a line, and you will implement a Foreign Key Constraint. Perhaps think of Fact as an "instance".
Just to make sure I understand the idea correctly, if I were to add "Genre" into the mix, would I be correct in thinking that a new 'independent' table genre_table would be created and the artistsong_table would inherit its PK as an FK?
Yes. It is a classic Reference or Lookup table, the Relationship will be Non-identifying. I don't know enough about the music brothelry to make any declarations, but as I understand it, Genre applies to a Song; an Artist; and an ArtistSong (they can play a Song in a Genre that is different to the Song.Genre). You have given me one, so I will model that.
The consequence of that is, when you are inserting rows in ArtistSong, you will have to have the Genre. If that is in the feed, well and good, if not, you have a processing issue to deal with. The simple method to overcome that is, implement a Genre "", which indicates to you that you need to determine it from other channels.
It is easy enough to add a classifier (eg. Genre) later, because it is a Non-identifying Relationship. But Identifying items are difficult to add later, because they force the Keys to change. Refer para 3 under my Response 1.
You are probably ready for a Data Model:
Third Draft Music Chart Data Model
It all depends on the relationships (one-to-one, one-to-many, many-to-many) your data is going to have.
The way you implemented your charts table indicates that:
Each chart has only/belongs to one song
A song can have many charts
It is a one-to-many relationship. And if that was what you intended then everything seems fine.
However:
If your charts can have many songs and a song will have only one
chart (also a one-to-many relationship but reversed), the song_id column needs to
be taken out from the charts table and the songs table needs
chart_id column in.
If your charts can have many songs and your songs can have many charts as well (many-to-many relationship), then you need a "joint table" which could be something like this:
TABLE: charts_songs, COLUMNS: id, chart_id, song_id, position

Do I need to use state pattern for data approval process?

Users of our system are able to submit un-validated contact data. For example:
Forename: null
Surname: 231
TelephoneNumber: not sure
etc
This data is stored in a PendingContacts table.
I have another table - ApprovedContacts. This table has a variety of constraints to improve consistency and integrity. This table shouldn't contain any dirty or incomplete data.
I need a process to move data from one table to another. Structure of both tables is nearly identical, however, one table has the constraints, when another doesn't.
I have two states: Pending and Approved, gut feeling tells me that I should use a state pattern details here. In theory this should allow me to change contact's state from Pending to Approved, depending on whether the contact has been successfully validated. Problem is that I don't see how is this going to work.
Am I going in a right direction or should I be looking at something completely different?
Presentation layer is in MVC 3, so I have view models for pending contacts and approved contacts, as well as domain models for pending contacts and approved contacts. My view models are generally DTOs with some validation routines, but now my view models represent state too. This doesn't seem right.
For example, all contacts must have a state and they can be saved and removed , so I need an interface for that:
public interface IContactViewModelState
{
void Save(ContactViewModel item);
}
I then add an implementation for saving pending contacts into the PendingContacts table:
public class PendingContactViewModelState: IContactViewModelState
{
public void Save(ContactViewModel item)
{
// Save to the pending contacts table
// I don't like this as my view model now references data access layer
}
}
Short answer: no, because you only have two states. You'd use a state pattern to help deal with complex situations with many states and rules. The only reason you might want to go with a full-blown state pattern based implementation is if you there's a very high chance such a situation is imminent.
If the result of a success transition to Approved is the record ending up in the approved table then you really just need to decide where you want to enforce the constraints. This decision will/can be based on many factors including the likely frequency of change (to the constraints) and where other logic resides.
A lot of patterns (but not all) tend to deal with how to structure an application, but here I think it's just a case of deciding where and how implementing some logic. In other words - you might just be accidentally over-analyzing the problem - it's easily done :)

Resources