Teradata: How to design table to be normalized with many foreign key columns? - performance

I am designing a table in Teradata with about 30 columns. These columns are going to need to store several time-interval-style values such as Daily, Monthly, Weekly, etc. It is bad design to store the actual string values in the table since this would be an attrocious repeat of data. Instead, what I want to do is create a primitive lookup table. This table would hold Daily, Monthly, Weekly and would use Teradata's identity column to derive the primary key. This primary key would then be stored in the table I am creating as foreign keys.
This would work fine for my application since all I need to know is the primitive key value as I populate my web form's dropdown lists. However, other applications we use will need to either run reports or receive this data through feeds. Therefore, a view will need to be created that joins this table out to the primitives table so that it can actually return Daily, Monthly, and Weekly.
My concern is performance. I've never created a table with such a large amount of foreign key fields and am fairly new to Teradata. Before I go on the long road of figuring this all out the hard way, I'd like any advice I can get on the best way to achieve my goal.
Edit: I suppose I should add that this lookup table would be a mishmash of unrelated primitives. It would contain group of values relating to time intervals as already mentioned above, but also time frames such as 24x7 and 8x5. The table would be designed like this:
ID Type Value
--- ------------ ------------
1 Interval Daily
2 Interval Monthly
3 Interval Weekly
4 TimeFrame 24x7
5 TimeFrame 8x5
Edit Part 2: Added a new tag to get more exposure to this question.

What you've done should be fine. Obviously, you'll need to run the actual queries and collect statistics where appropriate.
One thing I can recommend is to have an additional row in the lookup table like so:
ID Type Value
--- ------------ ------------
0 Unknown Unknown
Then in the main table, instead of having fields as null, you would give them a value of 0. This allows you to use inner joins instead of outer joins, which will help with performance.

Related

CDC strategy for multiple staging tables

I'm implementing a Data Mart following the Kimball methodology and I have a challenge with applying deltas from multiple source tables against a single target dimension.
Here's an example of the incoming source data:
STG_APPLICATION
APP_ID, APP_NAME, APP_START_DATE, CDC_HASH, ...
1, FOOBAR, 20/10/2018, MD5_XXX
STG_APPLICATION_STATUS
APP_ID, STATUS_CODE, STATUS_DESC, CDC_HASH, ...
1, SUBMITTED, "APP WAS SUBMITTED", MD5_YYY
Each of these tables (there are several others) represent a normalised version of the source data i.e. a single application can have one or more statuses associated with it.
Now then, because we only get a full alpha for these tables we have to do a snapshot merge, i.e. apply a full outer join on the current day set of records against the previous day set of records for each individual table. This is computed by comparing the CDC_HASH (a concat of all source columns). The result of this comparison is stored in a delta table as follows:
STG_APPLICATION_DELTA
APP_ID, APP_NAME, APP_START_DATE, CDC_HASH, CDC_STATUS ...
STG_APPLICATION_STATUS
APP_ID, STATUS_CODE, STATUS_DESC, CDC_HASH, CDC_STATUS...
1, AWARDED, "APP WAS AWARDED", MD5_YYY, NEW
So in this example, the first table, STG_APPLICATION did not generate a delta record as the attributes pertaining to that table did not change between daily loads. However, the associated table, STG_APPLICATION_STATUS, did calculate a delta, i.e. one or more fields have changed since the last load. This is highlighted by the CDC_STATUS which identifies it as a new record to insert.
The problem now of course is how to correctly handle this situation when loading the target dimension? For example:
DIM_APPLICATION
ID, APPLICATION_ID, APP_NAME, APP_START_DATE, APP_STATUS_CODE, FROM_DATE, TO_DATE
1, 1, FOOBAR, 20/10/2018, SUBMITTED, 20/10/2018, 12/04/2019
2, 1, NULL, NULL, NULL, AWARDED, 13/04/2019, 99/99/9999
This shows the first record - based on these two staging tables being joined - and the second record which is meant to reflect an updated version of the record. However, as previously illustrated, my Delta tables are only partially populated, and therefore I am unable to correctly update the dimension as shown here.
Logically, I understand that I need to be able to include all fields that are used by the dimension as part of my delta calculation, so that I have a copy of a full record when updating the dimension, but I'm not sure of the best way to implement this in my staging area. As shown already, I currently only have independent staging tables, each of which calculate their delta separately.
Please can somebody advise on the best way to handle this? I'm scrutinized Kimball's books on this but to no avail. And I've equally found no suitable answer on any online forums. This is a common problem so I'm sure there exists a suitable architectural pattern to resolve this.
You will need to either compare on joined records or lookup the current dimension values.
If the amount of (unchanged) data is not excessive, you could join the full snapshots of STG_APPLICATION and STG_APPLICATION_STATUS together on APP_ID until they resemble the dimension record column-wise and store those in a separate table with their CDC hash to use as previous day. You then take the deltas at this level and send the (complete) changed records as updates to the dimension.
If the amount of records in the daily update makes it impractical to join the full tables, you can take the deltas and full outer join them as you do now. Then you look up the current dimension record for this APP_ID and fill in all empty fields in the delta record. The completed record is then sent as an update to the dimension.
This solution requires less storage but seems more fragile, especially if multiple changes are possible within a day. If there are many changes, performance may also suffer. For a handful of changes in millions of records, it should be more efficient.

How to get the last "row" in a cassandra's long row

In Cassandra, a row can be very long and store units of time relevant data. For example, one row could look like the following:
RowKey: "weather"
name=2013-01-02:temperature, value=90,
name=2013-01-02:humidity, value=23,
name=2013-01-02:rain, value=false",
name=2013-01-03:temperature, value=91,
name=2013-01-03:humidity, value=24,
name=2013-01-03:rain, value=false",
name=2013-01-04:temperature, value=90,
name=2013-01-04:humidity, value=23,
name=2013-01-04:rain, value=false".
9 columns of 3 days' weather info.
time is a primary key in this row. So the order of this row would be time based.
My question is, is there any way for me to do a query like: what is the last/first day's humidity value in this row? I know I could use a Order By statement in CQL but since this row is already sorted by time, there should be some way to just get the first/last one directly, instead of doing another sort. Or is cassandra optimizing it already with Order By statement under the hood?
Another way I could think of is, store another column in this row called "last_time_stamp" that always updates itself as new data is inserted in. But that would require one more update every time I insert new weather data.
Thanks for any suggestion!:)
Without seeing more of your actual table, I suggest using a timestamp (or timeuuid if there is a possibility for collisions) as the second component in a compound primary key. Using this, you can get the last "row" by selecting ORDER BY t DESC LIMIT 1.
You could also change the clustering order in your schema to order it naturally for "last N" queries.
Please see examples and linked resource in this answer.

Fact table organization

I am participating in creation of reporting software which utilizes Kimball star schema methodology. Entire team (including me) hasn't worked with this technology so we are new in this.
There are couple of dimension and fact tables in or system so far. For example:
- DIM_Customer (dimension table for customers)
- DIM_BusinessUnit (dimension table for business units)
- FT_Transaction (fact table, granularity per transaction)
- FT_Customer (fact table for customer, customer id and as on date are in composite PK)
This is current structure of FT_Customer:
- customer_id # (customer id, part of composite PK)
- as_on_date # (date of observation, part of composite PK)
- waic (KPI)
- wat (KPI)
- waddl (KPI)
- wadtp (KPI)
- aging_bucket_current (KPI)
- aging_bucket_1_to_10 (KPI)
- aging_bucket_11_to_25 (KPI)
- ... ...
Fields waic, wat, waddl and wadtp are related to delay in transaction payment. These fields are calculated by aggregation query against FT_Transaction table grouped by customer_id and as_on_date.
Fields aging_bucket_current, aging_bucket_1_to_10 and aging_bucket_11_to_25 contains number of transactions categorized by delay in payment. For example aging_bucket_current contains number of transactions that are payed on time, aging_bucket_1_to_10 contains number of transactions that are payed with 1 to 10 days delay ...
This structure is used for report generation from PHP web application as well as Cognos studio. We discussed about restructuring FT_Customer table in order that make it more usable for external systems like Cognos.
New proposed structure of FT_Customer:
- customer_id # (customer id, part of composite PK)
- as_on_date # (date of observation, part of composite PK)
- kpi_id # (id of KPI, foreign key that points to DIM_KPI dimension table, part of composite PK)
- kpi_value (value KPI)
- ... ...
For this proposal we will have additional dimension table DIM_KPI:
- kpi_id #
- title
This table will contain all KPIs (wat, waic, waddl, aging buckets ...).
Second structure of FT_Customer will obviously have more rows than current structure.
Which structure of FT_Customer is more universal?
Is it acceptable to keep both structures in separate tables? This will obviously put additional burden to ETL layer because some of work will be done twice, but on the other side it will make easier generation of various reports.
Thanks in advance for suggenstions.
The 1st structure seems to be more natural and common to me. However, the 2nd one is more flexible, because it supports adding new KPIs without changing the structure of the fact table.
If different ways of accessing data actually require different structures, there is nothing wrong about having two fact tables with the same data, as long as:
both tables are always loaded together (not necessarily in parallel, but within the same data load job/workflow),
measures calculation are consistent (reuse the logic if possible).
You should test the results for any data inconsistencies.
Before you proceed, go buy yourself Agile Data Warehouse Design and read it thoroughly. It's pretty cheap.
http://www.amazon.com/Agile-Data-Warehouse-Design-Collaborative/dp/0956817203
Your fact tables are for processes or events that you want to analyze. You should name them noun_verb_noun (example customers_order_items). If you can't come up with a name like that, you probably don't have a fact table. What is your Customer Fact table for? Customer is usually a dimension table.
The purpose of your data warehouse is to facilitate analysis. Use longer column names (with _ as word separator). Make life easy on your analysts.

Best pratice to save amount range values in db

I have one account table in that table I need to save the amount range
I have one drop down that has the values like $25k-$30k, $30k-$35k it needs to increase by 5 up to $250k.
I have planed to have all the values in one table (currency range) and I will map the id to the account. but my mate suggest that
it is beteer to save the values directly to account table.
Which is a best practice?
This Question may be closed by someone. I need only which is a best practice only.
First of all its wrong design approach to manage the range in varchar column.
I am not sure about your purpose to keep the range in varchar. If this is only for display and not required any manupulation then its better to change the account table directly.
But if you are doing further manupulation then here we have two approached to achieve it
1. It would good to have two saperate columns sale "MinValue" and "MaxValue" for limit range
2. If you are not suppose to change the account table then it will be good to keep a saperate table accountLimit and will have two column for range. Now you can associate the ID with account table. and can pick the value from the table accountLimit.

Having more than 50 column in a SQL table

I have designed my database in such a way that One of my table contains 52 columns. All the attributes are tightly associated with the primary key attribute, So there is no scope of further Normalization.
Please let me know if same kind of situation arises and you don't want to keep so many columns in a single table, what is the other option to do that.
It is not odd in any way to have 50 columns. ERP systems often have 100+ columns in some tables.
One thing you could look into is to ensure most columns got valid default values (null, today etc). That will simplify inserts.
Also ensure your code always specifies the columns (i.e no "select *"). Any kind of future optimization will include indexes with a subset of the columns.
One approach we used once, is that you split your table into two tables. Both of these tables get the primary key of the original table. In the first table, you put your most frequently used columns and in the second table you put the lesser used columns. Generally the first one should be smaller. You now can speed up things in the first table with various indices. In our design, we even had the first table running on memory engine (RAM), since we only had reading queries. If you need to get the combination of columns from table1 and table2 you need to join both tables with the primary key.
A table with fifty-two columns is not necessarily wrong. As others have pointed out many databases have such beasts. However I would not consider ERP systems as exemplars of good data design: in my experience they tend to be rather the opposite.
Anyway, moving on!
You say this:
"All the attributes are tightly associated with the primary key
attribute"
Which means that your table is in third normal form (or perhaps BCNF). That being the case it's not true that no further normalisation is possible. Perhaps you can go to fifth normal form?
Fifth normal form is about removing join dependencies. All your columns are dependent on the primary key but there may also be dependencies between columns: e.g, there are multiple values of COL42 associated with each value of COL23. Join dependencies means that when we add a new value of COL23 we end up inserting several records, one for each value of COL42. The Wikipedia article on 5NF has a good worked example.
I admit not many people go as far as 5NF. And it might well be that even with fifty-two columns you table is already in 5NF. But it's worth checking. Because if you can break out one or two subsidiary tables you'll have improved your data model and made your main table easier to work with.
Another option is the "item-result pair" (IRP) design over the "multi-column table" MCT design, especially if you'll be adding more columns from time to time.
MCT_TABLE
---------
KEY_col(s)
Col1
Col2
Col3
...
IRP_TABLE
---------
KEY_col(s)
ITEM
VALUE
select * from IRP_TABLE;
KEY_COL ITEM VALUE
------- ---- -----
1 NAME Joe
1 AGE 44
1 WGT 202
...
IRP is a bit harder to use, but much more flexible.
I've built very large systems using the IRP design and it can perform well even for massive data. In fact it kind of behaves like a column organized DB as you only pull in the rows you need (i.e. less I/O) rather that an entire wide row when you only need a few columns (i.e. more I/O).

Resources