Fact table organization - etl

I am participating in creation of reporting software which utilizes Kimball star schema methodology. Entire team (including me) hasn't worked with this technology so we are new in this.
There are couple of dimension and fact tables in or system so far. For example:
- DIM_Customer (dimension table for customers)
- DIM_BusinessUnit (dimension table for business units)
- FT_Transaction (fact table, granularity per transaction)
- FT_Customer (fact table for customer, customer id and as on date are in composite PK)
This is current structure of FT_Customer:
- customer_id # (customer id, part of composite PK)
- as_on_date # (date of observation, part of composite PK)
- waic (KPI)
- wat (KPI)
- waddl (KPI)
- wadtp (KPI)
- aging_bucket_current (KPI)
- aging_bucket_1_to_10 (KPI)
- aging_bucket_11_to_25 (KPI)
- ... ...
Fields waic, wat, waddl and wadtp are related to delay in transaction payment. These fields are calculated by aggregation query against FT_Transaction table grouped by customer_id and as_on_date.
Fields aging_bucket_current, aging_bucket_1_to_10 and aging_bucket_11_to_25 contains number of transactions categorized by delay in payment. For example aging_bucket_current contains number of transactions that are payed on time, aging_bucket_1_to_10 contains number of transactions that are payed with 1 to 10 days delay ...
This structure is used for report generation from PHP web application as well as Cognos studio. We discussed about restructuring FT_Customer table in order that make it more usable for external systems like Cognos.
New proposed structure of FT_Customer:
- customer_id # (customer id, part of composite PK)
- as_on_date # (date of observation, part of composite PK)
- kpi_id # (id of KPI, foreign key that points to DIM_KPI dimension table, part of composite PK)
- kpi_value (value KPI)
- ... ...
For this proposal we will have additional dimension table DIM_KPI:
- kpi_id #
- title
This table will contain all KPIs (wat, waic, waddl, aging buckets ...).
Second structure of FT_Customer will obviously have more rows than current structure.
Which structure of FT_Customer is more universal?
Is it acceptable to keep both structures in separate tables? This will obviously put additional burden to ETL layer because some of work will be done twice, but on the other side it will make easier generation of various reports.
Thanks in advance for suggenstions.

The 1st structure seems to be more natural and common to me. However, the 2nd one is more flexible, because it supports adding new KPIs without changing the structure of the fact table.
If different ways of accessing data actually require different structures, there is nothing wrong about having two fact tables with the same data, as long as:
both tables are always loaded together (not necessarily in parallel, but within the same data load job/workflow),
measures calculation are consistent (reuse the logic if possible).
You should test the results for any data inconsistencies.

Before you proceed, go buy yourself Agile Data Warehouse Design and read it thoroughly. It's pretty cheap.
http://www.amazon.com/Agile-Data-Warehouse-Design-Collaborative/dp/0956817203
Your fact tables are for processes or events that you want to analyze. You should name them noun_verb_noun (example customers_order_items). If you can't come up with a name like that, you probably don't have a fact table. What is your Customer Fact table for? Customer is usually a dimension table.
The purpose of your data warehouse is to facilitate analysis. Use longer column names (with _ as word separator). Make life easy on your analysts.

Related

CDC strategy for multiple staging tables

I'm implementing a Data Mart following the Kimball methodology and I have a challenge with applying deltas from multiple source tables against a single target dimension.
Here's an example of the incoming source data:
STG_APPLICATION
APP_ID, APP_NAME, APP_START_DATE, CDC_HASH, ...
1, FOOBAR, 20/10/2018, MD5_XXX
STG_APPLICATION_STATUS
APP_ID, STATUS_CODE, STATUS_DESC, CDC_HASH, ...
1, SUBMITTED, "APP WAS SUBMITTED", MD5_YYY
Each of these tables (there are several others) represent a normalised version of the source data i.e. a single application can have one or more statuses associated with it.
Now then, because we only get a full alpha for these tables we have to do a snapshot merge, i.e. apply a full outer join on the current day set of records against the previous day set of records for each individual table. This is computed by comparing the CDC_HASH (a concat of all source columns). The result of this comparison is stored in a delta table as follows:
STG_APPLICATION_DELTA
APP_ID, APP_NAME, APP_START_DATE, CDC_HASH, CDC_STATUS ...
STG_APPLICATION_STATUS
APP_ID, STATUS_CODE, STATUS_DESC, CDC_HASH, CDC_STATUS...
1, AWARDED, "APP WAS AWARDED", MD5_YYY, NEW
So in this example, the first table, STG_APPLICATION did not generate a delta record as the attributes pertaining to that table did not change between daily loads. However, the associated table, STG_APPLICATION_STATUS, did calculate a delta, i.e. one or more fields have changed since the last load. This is highlighted by the CDC_STATUS which identifies it as a new record to insert.
The problem now of course is how to correctly handle this situation when loading the target dimension? For example:
DIM_APPLICATION
ID, APPLICATION_ID, APP_NAME, APP_START_DATE, APP_STATUS_CODE, FROM_DATE, TO_DATE
1, 1, FOOBAR, 20/10/2018, SUBMITTED, 20/10/2018, 12/04/2019
2, 1, NULL, NULL, NULL, AWARDED, 13/04/2019, 99/99/9999
This shows the first record - based on these two staging tables being joined - and the second record which is meant to reflect an updated version of the record. However, as previously illustrated, my Delta tables are only partially populated, and therefore I am unable to correctly update the dimension as shown here.
Logically, I understand that I need to be able to include all fields that are used by the dimension as part of my delta calculation, so that I have a copy of a full record when updating the dimension, but I'm not sure of the best way to implement this in my staging area. As shown already, I currently only have independent staging tables, each of which calculate their delta separately.
Please can somebody advise on the best way to handle this? I'm scrutinized Kimball's books on this but to no avail. And I've equally found no suitable answer on any online forums. This is a common problem so I'm sure there exists a suitable architectural pattern to resolve this.
You will need to either compare on joined records or lookup the current dimension values.
If the amount of (unchanged) data is not excessive, you could join the full snapshots of STG_APPLICATION and STG_APPLICATION_STATUS together on APP_ID until they resemble the dimension record column-wise and store those in a separate table with their CDC hash to use as previous day. You then take the deltas at this level and send the (complete) changed records as updates to the dimension.
If the amount of records in the daily update makes it impractical to join the full tables, you can take the deltas and full outer join them as you do now. Then you look up the current dimension record for this APP_ID and fill in all empty fields in the delta record. The completed record is then sent as an update to the dimension.
This solution requires less storage but seems more fragile, especially if multiple changes are possible within a day. If there are many changes, performance may also suffer. For a handful of changes in millions of records, it should be more efficient.

Oracle database help optimizing LIKE searches

I am on Oracle 11g and we have these 3 core tables:
Customer - CUSTOMERID|DOB
CustomerName - CUSTOMERNAMEID|CustomerID|FNAME|LNAME
Address - ADDRESSID|CUSTOMERID|STREET|CITY|STATE|POSTALCODE
I have about 60 million rows on each of the tables and the data is a mix of US and Canadian population.
I have a front-end application that calls a web service and they do a last name and partial zip search. So my query basically has
where CUSTOMERNAME.LNAME = ? and ADDRESS.POSTALCODE LIKE '?%'
They typically provide the first 3 digits of the zip.
The address table has an index on all street/city/state/zip and another one on state and zip.
I did try adding an index exclusively for the zip and forced oracle to use that index on my query but that didn't make any difference.
For returning about 100 rows (I have pagination to only return 100 at a time) it takes about 30 seconds which isn't ideal. What can I do to make this better?
The problem is that the filters you are applying are not very selective and they apply to different tables. This is bad for an old-fashioned btree index. If the content is very static you could try bitmap indexes. More precisely a function based bitmap join index on the first three letter of the last name and a bitmap join index on the postal code column. This assumes that very few people with the whose last name starts with certain letters live in an are with a certain postal code.
CREATE BITMAP INDEX ix_customer_custname ON customer(SUBSTR(cn.lname,1,3))
FROM customer c, customername cn
WHERE c.customerid = cn.customerid;
CREATE BITMAP INDEX ix_customer_postalcode ON customer(SUBSTR(a.postalcode,1,3))
FROM customer c, address a
WHERE c.customerid = a.customerid;
If you are successful you should see the two bitmap indexes becoming AND connected. The execution time should drop to a couple of seconds. It will not be as fast as a btree index.
Remarks:
You may have to play around a bit whether it is more efficient to make one or two indexes and whether the function are helpful useful.
If you decide to do it function based you should include the exact same function calls in the where clause of your query. Otherwise the index will not be used.
DML operations will be considerably slower. This is only useful for tables with static data. Note that DML operations will block whole row "ranges". Concurrent DML operations will run into problems.
Response time will probably still be seconds not instanteously like a BTREE index.
AFAIK this will work only on the enterprise edition. The syntax is untested because I do not have an enterprise db available at the moment.
If this is still not fast enough you can create a materialized view with customerid, last name and postal code and but a btree index on it. But that is kind of expensive, too.

how to put data in fact table?

i'm new in business intelligence
and i design a star schema that implement a data mart to help analyst to take a decision about student grades
dimensions tables :
- module (module code, module name) that contains information about the module
- student ( code, first_name, last_name, ....) that contains information about the model
- school subject ( code, name, professor name...)
- degree ( code, libelle)
- specialite (code, libelle)
- time(year,half year)
- geographie(continent,country,city)
fact table :
- result ( score, module score, year score)
the data source is excel file :
in each file i have a set of sheet for each sheet he present a students score in "Niveau 'X' , Specialite 'Y', Year and Half-Year 'Z',Module 'U',City 'A'...
my question is :
how i can't put data from excel to my dimensions and fact
to dimensions i suppose that is easy but i need your proposition
to fact i have no idea
i'm sorry for my bad english
Most basic answer, pick an ETL tool and start moving the data.
You will generally need to:
Load your dimension tables first. The ID columns in these tables will link to the fact table.
In your ETL package/routine to populate the fact table,
select the data to be placed in the fact table from the source/staging.
do a lookup on each of the dimension tables against this data to get the ID of each Dimension value.
Finally do some duplicate detection to see if any of the rows are already in the fact table.
insert the data.
This process will be broadly similar regardless of the ETL tool you use. There are a few tutorials that go into some detail (use google) but the basic technique is lookups to get the dimension keys.

Teradata: How to design table to be normalized with many foreign key columns?

I am designing a table in Teradata with about 30 columns. These columns are going to need to store several time-interval-style values such as Daily, Monthly, Weekly, etc. It is bad design to store the actual string values in the table since this would be an attrocious repeat of data. Instead, what I want to do is create a primitive lookup table. This table would hold Daily, Monthly, Weekly and would use Teradata's identity column to derive the primary key. This primary key would then be stored in the table I am creating as foreign keys.
This would work fine for my application since all I need to know is the primitive key value as I populate my web form's dropdown lists. However, other applications we use will need to either run reports or receive this data through feeds. Therefore, a view will need to be created that joins this table out to the primitives table so that it can actually return Daily, Monthly, and Weekly.
My concern is performance. I've never created a table with such a large amount of foreign key fields and am fairly new to Teradata. Before I go on the long road of figuring this all out the hard way, I'd like any advice I can get on the best way to achieve my goal.
Edit: I suppose I should add that this lookup table would be a mishmash of unrelated primitives. It would contain group of values relating to time intervals as already mentioned above, but also time frames such as 24x7 and 8x5. The table would be designed like this:
ID Type Value
--- ------------ ------------
1 Interval Daily
2 Interval Monthly
3 Interval Weekly
4 TimeFrame 24x7
5 TimeFrame 8x5
Edit Part 2: Added a new tag to get more exposure to this question.
What you've done should be fine. Obviously, you'll need to run the actual queries and collect statistics where appropriate.
One thing I can recommend is to have an additional row in the lookup table like so:
ID Type Value
--- ------------ ------------
0 Unknown Unknown
Then in the main table, instead of having fields as null, you would give them a value of 0. This allows you to use inner joins instead of outer joins, which will help with performance.

Having more than 50 column in a SQL table

I have designed my database in such a way that One of my table contains 52 columns. All the attributes are tightly associated with the primary key attribute, So there is no scope of further Normalization.
Please let me know if same kind of situation arises and you don't want to keep so many columns in a single table, what is the other option to do that.
It is not odd in any way to have 50 columns. ERP systems often have 100+ columns in some tables.
One thing you could look into is to ensure most columns got valid default values (null, today etc). That will simplify inserts.
Also ensure your code always specifies the columns (i.e no "select *"). Any kind of future optimization will include indexes with a subset of the columns.
One approach we used once, is that you split your table into two tables. Both of these tables get the primary key of the original table. In the first table, you put your most frequently used columns and in the second table you put the lesser used columns. Generally the first one should be smaller. You now can speed up things in the first table with various indices. In our design, we even had the first table running on memory engine (RAM), since we only had reading queries. If you need to get the combination of columns from table1 and table2 you need to join both tables with the primary key.
A table with fifty-two columns is not necessarily wrong. As others have pointed out many databases have such beasts. However I would not consider ERP systems as exemplars of good data design: in my experience they tend to be rather the opposite.
Anyway, moving on!
You say this:
"All the attributes are tightly associated with the primary key
attribute"
Which means that your table is in third normal form (or perhaps BCNF). That being the case it's not true that no further normalisation is possible. Perhaps you can go to fifth normal form?
Fifth normal form is about removing join dependencies. All your columns are dependent on the primary key but there may also be dependencies between columns: e.g, there are multiple values of COL42 associated with each value of COL23. Join dependencies means that when we add a new value of COL23 we end up inserting several records, one for each value of COL42. The Wikipedia article on 5NF has a good worked example.
I admit not many people go as far as 5NF. And it might well be that even with fifty-two columns you table is already in 5NF. But it's worth checking. Because if you can break out one or two subsidiary tables you'll have improved your data model and made your main table easier to work with.
Another option is the "item-result pair" (IRP) design over the "multi-column table" MCT design, especially if you'll be adding more columns from time to time.
MCT_TABLE
---------
KEY_col(s)
Col1
Col2
Col3
...
IRP_TABLE
---------
KEY_col(s)
ITEM
VALUE
select * from IRP_TABLE;
KEY_COL ITEM VALUE
------- ---- -----
1 NAME Joe
1 AGE 44
1 WGT 202
...
IRP is a bit harder to use, but much more flexible.
I've built very large systems using the IRP design and it can perform well even for massive data. In fact it kind of behaves like a column organized DB as you only pull in the rows you need (i.e. less I/O) rather that an entire wide row when you only need a few columns (i.e. more I/O).

Resources