Dimensional Modeling: same degenerate dimension in multiple fact table or using conformed dimension - dimensional-modeling

I have a retail sales system and want to create a data warehouse using Dimensional Modeling by ralph kimball.
I have a simple order fact table that measures order quantity and order dollar amount. From what I read in the book and internet, the order number is a degenerate dimension that sits on the fact table.
The order fact table then has a status that initially I thought it's a flow, so accumulated snapshot fact table is come to my mind. It was all fine until I realize the status is actually not a flow, it's a label so the order can change status from 'a' to 'b' then to 'a' again. My case is even worse because the order fact table is now has 3 types of status that its changes need to be tracked. So I think accumulated snapshot fact table is not working here.
My attempt is I create 4 fact tables, order order status a order status b order status c.
Every new order creates a row in order table as well as one row for initial status in each order status. Then every change to the status is done by creating a new row in order status.
Since the order status table is related to the order table, I need these 3 order status tables to have reference to the order table. How to do it? is using the same order number (degenerate dimension) suffice? I think of conformed dimension also can solve this, but these dimension rows will grow as much as order table. Any thought on this?

Related

Dynamic Calculated Column on different report level SSAS DAX query

I'm trying to create a calculated column based on a derived measure in SSAS cube, this measure which will count the number of cases per order so for one order if it has 3 cases it will have the value 3.
Now I'm trying to create a bucket attribute which says 1caseOrder,2caseOrder,3caseOrder,3+caseOrder. I tried the below one
IF([nrofcase] = 1, "nrofcase[1]", IF([nrofcase] = 2, "nrofcase[2]",
IF([nrofcase] = 3, "nrofcase[3]", "nrofcase[>3]") )
But it doesn't work as expected, when the level of the report is changed from qtr to week it was suppose to recalculate on different level.
Please let me know if it case work.
Calculated columns are static. When the column is added and when the table is processed, the value is calculated and stored. The only way for the value to change is to reprocess the model. If the formula refers to a DAX measure, it will use the measure without any of the context from the report (eg. no row filters or slicers, etc.).
Think of it this way:
Calculated column is a fact about a row that doesn't change. It is known just by looking at a single row. An example of this is Cost = [Quantity] * [Unit Price]. Cost never changes and is known by looking at the Quantity and Unit Price columns. It doesn't matter what filters or context are in the report. Cost doesn't change.
A measure is a fact about a table. You have to look at multiple rows to calculate its value. An example is Total Cost = SUM(Sales[Cost]). You want this value to change depending on the context of time, region, product, etc., so it's value is not stored but calculated dynamically in the report.
It sounds like for your data, there are multiple rows that tell you the number of cases per order, so this is a measure. Use a measure instead of a calculated column.

Do i need a time dimension for my fact table to prevent duplication?

I am designing a Data Warehouse and need some help with my fact table.
My fact table is capturing the facts for aged debt, this table captures all transactions against bills.
The dimension keys i have are listed below:
dim_month_end_key
dim_customer_key
dim_billing_account_key
dim_property_key
dim_bill_key
dim_charge_key
dim_payment_plan_key
dim_income_type_key
dim_transaction_date_key
dim_bill_date_key
I am trying to work out what my level of granularity would be as all the keys together could be duplicated, let's say if a customer makes a payment twice in one day.
I am thinking to solve this i can add a time dimension as the time should always be different.
However the company do not need to report on time, do i add it to prevent duplication regardless?
Thanks
Cheryl
No you don't need a time dimension.
there may be an apparent duplication in your fact, but it will actually reflect 2 deposits in one day - so two valid records. the fact that you might not be able to tell the two transactions apart is not (necessarily) a problem for the system
the report will Sum all the deposits amounts, or count the number of deposits, along any dimension and the totals will still be fine.

what is skewed column in Oracle

I found some bottleneck of my query which select data from only single table then require time and i used non unique key index on two column and with column used in where clause.
select name ,isComplete from Student where year='2015' and isComplete='F'
Now i found some concept from internet like skewed column so what is it?
have an idea then plz help me?
and how to resolve problem of skewed column?
and how skewed column affect performance of the Query?
Skewed columns are columns in which the data is not evenly distributed among the rows.
For example, suppose:
You have a table order_lines with 100,000,000 rows
The table has a column named customer_id
You have 1,000,000 distinct customers
Some (very large) customers can have hundreds of thousands or millions of order lines.
In the above example, the data in order_lines.customer_id is skewed. On average, you'd expect each distinct customer_id to have 100 order lines (100 million rows divided by 1 million distinct customers). But some large customers have many, many more than 100 order lines.
This hurts performance because Oracle bases its execution plan on statistics. So, statistically speaking, Oracle thinks it can access order_lines based on a non-unique index on customer_id and get only 100 records back, which it might then join to another table or whatever using a NESTED LOOP operation.
But, then when it actually gets 1,000,000 order lines for a particular customer, the index access and nested loop join are hideously slow. It would have been far better for Oracle to do a full table scan and hash join to the other table.
So, when there is skewed data, the optimal access plan depends on which particular customer you are selecting!
Oracle lets you avoid this problem by optionally gathering "histograms" on columns, so Oracle knows which values have lots of rows and which have only a few. That gives the Oracle optimizer the information it needs to generate the best plan in most cases.
Full table scan and Index Scan both are depend on the Skewed column.
and Skewed column is nothing but your spread like gender column contain 60 male and 40 female.

BI: Fact Table Design/Data warehouse modelling

i have some issue in designing my Data Warehouse and ETL process because of the fact table. It contains over 100 millions rows for 2 years of accounting data. The dimensions are related to the fact table via Foreign Key, I also used surrogate key , indexes and views. How do you guys would deal with such a fact table in order to ensure a good performance , a reasonable ETL Process and to have an adaptive and resilient to changes Data Warehouse ? It will be partitioning the table by half year a good approach?
First, you should look again at your data-warehouse design.
In fact table, foreign keys combination must be unique per row. If not, there is something wrong with ETL process.
You can easily check this by comparing counts of all rows in fact table with count rows of query where you group by every foreign key (select count(*) from fact_table group by fk1, fk2, fk..n). It has to be equal.
Next, you told that you have surrogate keys as foreign keys. I think that's no reason to repeat you should use integers.
Partition fact table by month, I don't see why on half year period?
100 millions rows is not too big. Perhaps you should think about some columnar database (Vertica for example).
I created a columnstore index on the Fact Table and the query cost (relative to the batch) is now 14% with index and 86% without index. I think it's pretty good.
Execution Plan below.
http://uploadimage.ro/img.php?image=4508_execution_plan_sk6y.png

Index needed for max(col)?

I'm currently doing some data loading for a kind of warehouse solution. I get an data export from the production each night, which then must be loaded. There are no other updates on the warehouse tables. To only load new items for a certain table I'm currently doing the following steps:
get the current max value y for a specific column (id for journal tables and time for event tables)
load the data via a query like where x > y
To avoid performance issues (I load around 1 million rows per day) I removed most indices from the tables (there are only needed for production, not in the warehouse). But that way the retrieval of the max value takes some time...so my question is:
What is the best way to get the current max value for a column without an index on that column? I just read about using the stats but I don't know how to handle columns with 'timestamp with timezone'. Disabling the index before load, and recreate it afterwards takes much too long...
The minimum and maximum values that are computed as part of column-level statistics are estimates. The optimizer only needs them to be reasonably close, not completely accurate. I certainly wouldn't trust them as part of a load process.
Loading a million rows per day isn't terribly much. Do you have an extremely small load window? I'm a bit hard-pressed to believe that you can't afford the cost of indexing the row(s) you need to do a min/ max index scan.
If you want to avoid indexes, however, you probably want to store the last max value in a separate table that you maintain as part of the load process. After you load rows 1-1000 in table A, you'd update the row in this summary table for table A to indicate that the last row you've processed is row 1000. The next time in, you would read the value from the summary table and start at 1001.
If there is no index on the column, the only way for the DBMS to find the maximum value in the column is a complete table scan, which takes a long time for large tables.
I suppose a DBMS could try to keep track of the minimum and maximum values in the column (storing the values in the system catalog) as it does inserts, updates and deletes - but deletes are why no DBMS I know of tries to keep statistics up to date with per-row operations. If you delete the maximum value, finding the new maximum requires a table scan if the column is not indexed (and if it is indexed, the index makes it trivial to find the maximum value, so the information does not have to be stored in the system catalog). This is why they're called 'statistics'; they're an approximation to the values that apply. But when you request 'SELECT MAX(somecol) FROM sometable', you aren't asking for statistical maximum; you're asking for the actual current maximum.
Have the process that creates the extract file also extract a single row file with the min/max you want. I assume that piece is scripted on some cron or scheduler, so shouldn't be too much to ask to add min/max calcs to that script ;)
If not, just do a full scan. Million rows isn't much really, esp in a data warehouse environment.
This code was written with oracle, but should be compatible with most SQL versions:
This gets the key of the max(high_val) in the table according to the range.
select high_val, my_key
from (select high_val, my_key
from mytable
where something = 'avalue'
order by high_val desc)
where rownum <= 1
What this says is: Sort mytable by high_val descending for values where something = 'avalue'. Only grab the top row, which will provide you with the max(high_val) in the selected range and the my_key to that table.

Resources