Multiple sources for dimensions in Data Warehouse - dimensional-modeling

I am currently working on a financial Risk data warehouse. For my collateral dimension, I am souring the data from one source system. However, after further research by the business analyst, we found a legacy application that also holds collateral information which the bank also needs in the data warehouse. Bar a few common attributes that both source systems share, the legacy application contains a lot more attributes than what is defined already in my current collateral dimension. What is therefore the best way to onboard this new information in the Warehouse? I was thinking of extending the current collateral dimension but then would I need to do this every time I find a new source, which is very likely given the size of the bank. Alternatively is it better to create a new dimension called dimCollateralAdditionalInfo and add the extra attributes there?

As we always say that a DWH model is evolutive in the time since new business requirements can appear over time. The most important thing is to check if the new attributes are worth to be added and if they present an analytic axis.
You can store all the information in the dimCollateral and you need to think to manage this dimension properly in terms of optimization (indexes, data types...)
Or you can create an extended dimension dimCollateralExtension containing the additional info and it will have a one to one relationship with the master dimension
dimCollateral

Related

Seeking Advice For Oracle Data-Intensive Application

I'm endeavoring to develop an application that uses Oracle as the database back-end. The application will calculate several statistics from the various tables in the database. The front-end will most likely be a web application and this front-end will display various charts and calculated statistics. Now, I imagine that it would be more efficient to perform the calculations in the database rather than in the service layer because said calculations would need to be performed for every web request. That being the case, I'm not sure which mechanism to use. (e.g. stored procedure, function, view) To illustrate what I'm going for, suppose I want to keep statistics of student grades for many students. I would like to have a web interface that lets me view those statistics on student-by-student basis and also an all-inclusive basis. Some of the stats are dependent on aggregates (e.g. average, min, max) of all of the student grades and some stats are dependent only on an individual student. In this situation, every time a record is added or updated, the aggregates would have to be recalculated. So I am speculating that if I had a special table that held all of the calculated values I need and a trigger(s) to recalculate everything when a record is added/updated then all I would need to do from a web request point-of-view is have the service layer pull the desired values from this special table. I'm just not sure if this is the best way to go or not so I am asking the community for any input/advice. Note: Although I'm using Oracle, I'm open to using PostgreSQL or mySQL.
Thanks in advance
The scenario you are describing would be ideal for using materialized views. They can be designed to refresh automatically (and incrementally) every time the source data is updated by your application. The calculations would be built in to the view definition. No triggers required, and likely no stored procedures unless your calculations involve multiple steps. Check here: https://oracle-base.com/articles/misc/materialized-views and here: https://medium.com/oracledevs/lightning-fast-sql-with-real-time-materialized-views-12-things-developers-will-love-about-oracle-54bcc9eac358 for more info.

Design a dimension with multiple data sources

I am designing a few dimensions with multiple data sources and wonder what other people have done to align the multiple business keys per data source.
My Example:
I have 2 data sources - the Ordering System and the Execution System. The Ordering system has details about payment and what should happen; the Execution System has details on what actually happened (how long it took etc, who executed on the order). Data from both systems is need to created a single fact.
In both the Ordering and Execution system they is a Location table. The business keys from both systems are mapped via an esb . There are attributes in both systems that make up the complete picture about a single location. Billing information is in the Ordering system, latitude and longitude are in the Execution system. And Location Name exists in both systems.
How do you design a SCD accomodate changes from both systems to the dimension?
We follow a fairly strict Kimball methodology - fyi, but I am open to looking at everyone's solutions.
Not necessarily an answer but here are my thoughts:
You've already covered the real options in your comment. Either:
A. Merge it beforehand
You need some merge functionality in staging which matches the two (or more) records, creates a new common merge key and uses that in the dimension. This requires some form of lookup or reference to be stored in addition to normal DW data
OR
B. Merge it in the dimension
Put both records in the dimension and allow the reporting tool to 'merge' it by, for example, grouping by location name. This means you don't need prior merging logic you just dump it in the dimension
However you have two constraints that I feel makes the choice between A & B clearer
Firstly, you need an SCD (Type 2 I assume). This means Option B could get very complicated as when there is a change in one source record you have to go find the the other record and change it as well - very unpleasant for option B. You still need some kind of pre-stored key to link them, which means option B is no longer simple
Secondly, given that you have two sources for one attribute (Location Name), you need some kind of staging logic to pick a single name when these don't match
So given these two circumstances, I suggest that option A would be best - build some pre-merging logic, as the complexity of your requirements warrants it.
You'd think this would be a common issue but I've never found a good online reference explaining how someone solved this before.
My thinking is actually very trivial. First you need to be able to conclude what is your master dataset on Geo+Location and granularity.
My method will be:
DIM loading
Say below is my target
Dim_Location = {Business_key, Longitude, Latitude, Location Name}
Dictionary
Business_key = Always maps to master record from source system (in this case it is the execution system). Imagine now the unique key from business is combined (longitude, latitude) for this table.
Location Name = Again, since we assume the "Execution system" is master for our data then it will host from Source="Execution System".
The above table is now loaded for Fact lookup.
Fact Loading
You have already integrated record between execution system and billing system. It's a straight forward lookup and load in staging since it exists with necessary combination of geo_location.
Challenging scenarios
What if execution system has a late arriving records on orders?
What if same geo_location points to multiple location names? Not possible but worth profiling the data for errors.

Handling passive deletion updates (ie. archiving instead of deleting)

We are developing an application based on DDD principles. We have encountered a couple of problems so far that we can't answer nor can we find the answers on the Internet.
Our application is intended to be a cloud application for multiple companies.
One of the demands is that there are no physical deletions from the database. We make only passive deletion by setting Active property of entities to false. That takes care of Select, Insert and Delete operations, but we don't know how to handle update operations.
Update means changing values of properties, but also means that past values are deleted and there are many reasons that we don't want that. One of the primary reason is for Accounting purposes.
If we make all update statements as "Archive old values" and then "Create new values" we would have a great number of duplicate values. For eg., Company has Branches, and Company is the Aggregate Root for Branches. If I change Companies phone number, that would mean I have to archive old company and all of its branches and create completely new company with branches just for one property. This may be a good idea at first, but over time there will be many values which can clog up the database. Phone is maybe an irrelevant property, but changing the Address (if street name has changed, but company is still in the same physical location) is a far more serious problem.
Currently we are using ASP.NET MVC with EF CF for repository, but one of the demands is that we are able to easily switch, or add, another technology like WPF or WCF. Currently we are using Automapper to map DTO's to Domain entities and vice versa and DTO's are primary source for views, ie. we have no view models. Application is layered according to DDD principle, and mapping occurs in Service Layer.
Another demand is that we musn't create a initial entity in database and then fill the values, but an entire aggregate should be stored as a whole.
Any comments or suggestions are appreciated.
We also welcome any changes in demands (as this is an internal project, and not for a customer) and architecture, but only if it's absolutely neccessary.
Thank you.
Have you ever come across event sourcing? Sounds like it could be of use if you're interested in tracking the complete history of aggregates.
To be honest I would create another table that would be a change log inserting the old record and deleted records etc etc into it before updating the live data. Yes you are creating a lot of records but you are abstracting this data from live records and keeping this data as lean as possible.
Also when it comes to clean up and backup you have your live date and your changed / delete data and you can routinely back up and trim your old changed / delete and reduced its size depending on how long you have agreed to keep changed / delete data live with the supplier or business you are working with.
I think this would be the best way to go as your core functionality will be working on a leaner dataset and I'm assuming your users wont be wanting to check revision and deletions of records all the time? So by separating the data you are accessing it when it is needed instead of all the time because everything is intermingled.

Database design: Same table structure but different table

My latest project deals with a lot of "staging" data.
Like when a customer registers, the data is stored in "customer_temp" table, and when he is verified, the data is moved to "customer" table.
Before I start shooting e-mails, go on a rampage on how I think this is wrong and you should just put a flag on the row, there is always a chance that I'm the idiot.
Can anybody explain to me why this is desirable?
Creating 2 tables with the same structure, populating a table (table 1), then moving the whole row to a different table (table 2) when certain events occur.
I can understand if table 2 will store archival, non seldom used data.
But I can't understand if table 2 stores live data that can changes constantly.
To recap:
Can anyone explain how wrong (or right) this seemingly counter-productive approach is?
If there is a significant difference between a "customer" and a "potential customer" in the business logic, separating them out in the database can make sense (you don't need to always remember to query by the flag, for example). In particular if the data stored for the two may diverge in the future.
It makes reporting somewhat easier and reduces the chances of treating both types of entities as the same one.
As you say, however, this does look redundant and would probably not be the way most people design the database.
There seems to be several explanations about why would you want "customer_temp".
As you noted would be for archival purposes. To allow analyzing data but in that case the historical data should be aggregated according to some interesting query. However it using live data does not sound plausible
As oded noted, there could be a certain business logic that differentiates between customer and potential customer.
Or it could be a security feature which requires logging all attempts to register a customer in addition to storing approved customers.
Any time I see a permenant table names "customer_temp" I see a red flag. This typically means that someone was working through a problem as they were going along and didn't think ahead about it.
As for the structure you describe there are some advantages. For example the tables could be indexed differently or placed on different File locations for performance.
But typically these advantages aren't worth the cost cost of keeping the structures in synch for changes (adding a column to different tables searching for two sets of dependencies etc. )
If you really need them to be treated differently then its better to handle that by adding a layer of abstraction with a view rather than creating two separate models.
I would have used a single table design, as you suggest. But I only know what you posted about the case. Before deciding that the designer was an idiot, I would want to know what other consequences, intended or unintended, may have followed from the two table design.
For, example, it may reduce contention between processes that are storing new potential customers and processes accessing the existing customer base. Or it may permit certain columns to be constrained to be not null in the customer table that are permitted to be null in the potential customer table. Or it may permit write access to the customer table to be tightly controlled, and unavailable to operations that originate from the web.
Or the original designer may simply not have seen the benefits you and I see in a single table design.

Strategy for updating data in databases (Oracle)

We have a product using Oracle, with about 5000 objects in the database (tables and packages). The product was divided into two parts, the first is the hard part: client, packages and database schema, the second is composed basically by soft data representing processes (Workflow) that can be configured to run on our product.
Well, the basic processes (workflow) are delivered as part of the product, our customers can change these processes and adapt them to their needs, the problem arises when trying to upgrade to a newer version of the product, then trying to update the database records data, there are problems for records deleted or modified by our customers.
Is there a strategy to handle this problem?
It is common for a software product to be comprised of not just client and schema objects, but data as well; typically it seems to be called "static data", i.e. it is data that should only be modified by the software developer, and is usually not modifiable by end users.
If the end users bypass your security controls and modify/delete the static data, then you need to either:
write code that detects, and compensates for, any modifications the end user may have done; e.g. wipe the tables and repopulate with "known good" data;
get samples of modifications from your customers so you can hand-code customised update scripts for them, without affecting their customisations; or
don't allow modifications of static data (i.e. if they customise the product by changing data they shouldn't, you say "sorry, you modified the product, we don't support you".
From your description, however, it looks like your product is designed to allow customers to customise it by changing data in these tables; in which case, your code just needs to be able to adapt to whatever changes they may have made. That needs to be a fundamental consideration in the design of the upgrade. The strategy is to enumerate all the types of changes that users may have made (or are likely to have made), and cater for them. The only viable alternative is #1 above, which removes all customisations.

Resources