Star Schema - External Identifier fact or dimension? - etl

Here's a question I'm struggling with in a star schema design.
The outline is that we track packages with embedded globally unique identifiers (tags). Each of those tags creates to a series of chronological events. I consider the events to be the facts and am including the continuously variable values as columns in the fact table. Dimensions are things like the package type.
What I'm not sure about is whether the tag identifier should be in a dimension or directly on the fact table. We've currently got over 5 million unique tags we are tracking.
Is such a large dimension advisable?

It is a degenerate dimension and you should keep this column in the fact table.
http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/degenerate-dimension/
http://www.kimballgroup.com/2003/06/design-tip-46-another-look-at-degenerate-dimensions/

Related

How to create a DAX cross-sectional measure?

I don't know if I even worded the question correctly, but I'm trying to create a measure that depends on what is showing in the pivot table (using PowerPivot). In the image I posted, "DealMonth" is an expression in the PowerQuery table itself that simply takes the start date of the employee and subtracts it from the month a deal was closed in. That will show how long it took for that salesperson to close the deal. "TenureMonths" is also an expression in the PowerQuery table that calculates the tenure of the person. The values populating this screenshot are coming from a total headcount measure created. What I'm trying to do is create a separate measure that will show when the "TenureMonths" is less than the "DealMonth." So if the TenureMonths is 5, then after DealMonth of 5, the value would be 0. Is this possible?
Screenshot
I should add the following information.
"DealMonth" - Comes from the FactData table
"TenureMonths" - Comes from the DimSalesStart table
These two tables are joined by name. I feel like I'm so close because I can see what I want. The second image below is a copy/paste of the pivot table result but with my edits to show what I'd want to have shown. Basically, if(TenureMonths >= DealMonth,1,0). The trouble seems to be that since they're in two different tables, I can't make it work. The rows in the fact table are transactions, but the rows in the dim table are just the people with their start and end dates.
Desired Result
This is possible with some IF([measure1]<[measure2],blank(),[measure1]), however without seeing more of the data it will be hard to guide you specifically.
However you need to create two separate measures, one for TenureMonths and one for DealMonth, depending on the data this can be done with an aggregator forumla such as sum, min, max, etc (depends if there will be more than one value).
Then reference those two measures in the formula pattern I mentioned above, and that should give you want you want.
I figured out a solution. I added a dimension table for DealMonth itself and joined to my fact table. That allowed me to do the formulas that I needed.

Power BI - Use an unrelated table to find valid combinations of data

I think I can ask this question without any example data:
Let's say I have a large table, maybe 100 columns and a million rows. On each row, there are only a few valid combinations of data, but that only applies to five of the columns. The other 95 columns can really be anything.
I have another validation table of just those five relevant columns, where each row is a valid combination of values. Can I use this to create a "Valid/Invalid" flag on the main table?
I was thinking that we wouldn't be making a relationship between these tables, although, every validation column will also exist in the main table, so I suppose you could? In my real-world case there are no good primary keys, but if the solution is a many-to-many relationship I'll edit the question and title here.
Edit: the above was answered but here's a more difficult variation:
What if rows in the main table don't all include all five validation values? The validation table shows valid combinations, but nulls are allowed in the main table?

CDC strategy for multiple staging tables

I'm implementing a Data Mart following the Kimball methodology and I have a challenge with applying deltas from multiple source tables against a single target dimension.
Here's an example of the incoming source data:
STG_APPLICATION
APP_ID, APP_NAME, APP_START_DATE, CDC_HASH, ...
1, FOOBAR, 20/10/2018, MD5_XXX
STG_APPLICATION_STATUS
APP_ID, STATUS_CODE, STATUS_DESC, CDC_HASH, ...
1, SUBMITTED, "APP WAS SUBMITTED", MD5_YYY
Each of these tables (there are several others) represent a normalised version of the source data i.e. a single application can have one or more statuses associated with it.
Now then, because we only get a full alpha for these tables we have to do a snapshot merge, i.e. apply a full outer join on the current day set of records against the previous day set of records for each individual table. This is computed by comparing the CDC_HASH (a concat of all source columns). The result of this comparison is stored in a delta table as follows:
STG_APPLICATION_DELTA
APP_ID, APP_NAME, APP_START_DATE, CDC_HASH, CDC_STATUS ...
STG_APPLICATION_STATUS
APP_ID, STATUS_CODE, STATUS_DESC, CDC_HASH, CDC_STATUS...
1, AWARDED, "APP WAS AWARDED", MD5_YYY, NEW
So in this example, the first table, STG_APPLICATION did not generate a delta record as the attributes pertaining to that table did not change between daily loads. However, the associated table, STG_APPLICATION_STATUS, did calculate a delta, i.e. one or more fields have changed since the last load. This is highlighted by the CDC_STATUS which identifies it as a new record to insert.
The problem now of course is how to correctly handle this situation when loading the target dimension? For example:
DIM_APPLICATION
ID, APPLICATION_ID, APP_NAME, APP_START_DATE, APP_STATUS_CODE, FROM_DATE, TO_DATE
1, 1, FOOBAR, 20/10/2018, SUBMITTED, 20/10/2018, 12/04/2019
2, 1, NULL, NULL, NULL, AWARDED, 13/04/2019, 99/99/9999
This shows the first record - based on these two staging tables being joined - and the second record which is meant to reflect an updated version of the record. However, as previously illustrated, my Delta tables are only partially populated, and therefore I am unable to correctly update the dimension as shown here.
Logically, I understand that I need to be able to include all fields that are used by the dimension as part of my delta calculation, so that I have a copy of a full record when updating the dimension, but I'm not sure of the best way to implement this in my staging area. As shown already, I currently only have independent staging tables, each of which calculate their delta separately.
Please can somebody advise on the best way to handle this? I'm scrutinized Kimball's books on this but to no avail. And I've equally found no suitable answer on any online forums. This is a common problem so I'm sure there exists a suitable architectural pattern to resolve this.
You will need to either compare on joined records or lookup the current dimension values.
If the amount of (unchanged) data is not excessive, you could join the full snapshots of STG_APPLICATION and STG_APPLICATION_STATUS together on APP_ID until they resemble the dimension record column-wise and store those in a separate table with their CDC hash to use as previous day. You then take the deltas at this level and send the (complete) changed records as updates to the dimension.
If the amount of records in the daily update makes it impractical to join the full tables, you can take the deltas and full outer join them as you do now. Then you look up the current dimension record for this APP_ID and fill in all empty fields in the delta record. The completed record is then sent as an update to the dimension.
This solution requires less storage but seems more fragile, especially if multiple changes are possible within a day. If there are many changes, performance may also suffer. For a handful of changes in millions of records, it should be more efficient.

SNMP table with dynamic number of columns

I want to have a SNMP table with dynamic number of rows and columns.
The code which creates the OIDs in the snmpd is ready but now I'm having problems with the MIB file.
The MIB file allows dynamic number of rows(entries) but must have constant number of columns.
I'm looking for a way to solve this problem. The following solutions may be possible but I don't know if they are available on the MIB file:
The number of columns is between 1-32. If I could define the columns to be optional - it would solve my problem.
Having dynamic number of tables: If I could define Template table which will have Template name and OID, this will allow me to split my table to smaller dynamic tables with static number of columns.
Currently I can't find any record of such solutions.
SNMP does not allow a dynamic number of columns in a table. It requires that the MIB describes the table completely, so that a manager knows which columns are present, before trying to contact the agent.
Defining tables dynamically is also not permitted.
If you edit your question to describe the data you are trying to model, perhaps we could figure out whether or not it's possible to model it in a MIB. I can certainly imagine situations where the capabilities of SNMP are insufficient to model a data set. It works best where data is either scalar, a tree, or a table with a fixed set of columns.
Edit: As k1eran posted in a comment, it is possible to simply not populate some columns with data, leaving a "sparse table". Please see his comment for a link.

Having more than 50 column in a SQL table

I have designed my database in such a way that One of my table contains 52 columns. All the attributes are tightly associated with the primary key attribute, So there is no scope of further Normalization.
Please let me know if same kind of situation arises and you don't want to keep so many columns in a single table, what is the other option to do that.
It is not odd in any way to have 50 columns. ERP systems often have 100+ columns in some tables.
One thing you could look into is to ensure most columns got valid default values (null, today etc). That will simplify inserts.
Also ensure your code always specifies the columns (i.e no "select *"). Any kind of future optimization will include indexes with a subset of the columns.
One approach we used once, is that you split your table into two tables. Both of these tables get the primary key of the original table. In the first table, you put your most frequently used columns and in the second table you put the lesser used columns. Generally the first one should be smaller. You now can speed up things in the first table with various indices. In our design, we even had the first table running on memory engine (RAM), since we only had reading queries. If you need to get the combination of columns from table1 and table2 you need to join both tables with the primary key.
A table with fifty-two columns is not necessarily wrong. As others have pointed out many databases have such beasts. However I would not consider ERP systems as exemplars of good data design: in my experience they tend to be rather the opposite.
Anyway, moving on!
You say this:
"All the attributes are tightly associated with the primary key
attribute"
Which means that your table is in third normal form (or perhaps BCNF). That being the case it's not true that no further normalisation is possible. Perhaps you can go to fifth normal form?
Fifth normal form is about removing join dependencies. All your columns are dependent on the primary key but there may also be dependencies between columns: e.g, there are multiple values of COL42 associated with each value of COL23. Join dependencies means that when we add a new value of COL23 we end up inserting several records, one for each value of COL42. The Wikipedia article on 5NF has a good worked example.
I admit not many people go as far as 5NF. And it might well be that even with fifty-two columns you table is already in 5NF. But it's worth checking. Because if you can break out one or two subsidiary tables you'll have improved your data model and made your main table easier to work with.
Another option is the "item-result pair" (IRP) design over the "multi-column table" MCT design, especially if you'll be adding more columns from time to time.
MCT_TABLE
---------
KEY_col(s)
Col1
Col2
Col3
...
IRP_TABLE
---------
KEY_col(s)
ITEM
VALUE
select * from IRP_TABLE;
KEY_COL ITEM VALUE
------- ---- -----
1 NAME Joe
1 AGE 44
1 WGT 202
...
IRP is a bit harder to use, but much more flexible.
I've built very large systems using the IRP design and it can perform well even for massive data. In fact it kind of behaves like a column organized DB as you only pull in the rows you need (i.e. less I/O) rather that an entire wide row when you only need a few columns (i.e. more I/O).

Resources