Slowly Changing Dimensions - exact SQL query implementation to retrieve correct data - etl

I'm a bit new to BI development/ data warehousing, but am facing the old Slowly Changing Dimensions dilemma. I've read a lot about the types and theory, but have found little in terms of, what I view, would be the most common SELECT queries against these implementations.
I'll keep my example simple. Say you have four sales reasons, East, West, North, and South. You have a group of salespeople that make daily sales and (maybe once a year) get reassigned a new region.
So you'll have raw data like the following:
name; sales; revenue; date
John Smith; 10; 5400; 2015-02-17
You have data like this every day.
You may also have a dimensional table like the following, initially:
name; region
John Smith; East
Nancy Ray; West
Claire Faust; North
So the sales director wants to know the monthly sales revenue for the East region for May 2015. You would execute a query:
SELECT region, month(date), sum(revenue)
from Fact_Table inner join Dim_Table on name = name
where region = East and date between ....
[group by region, month(date)]
You get the idea. Let's ignore that I'm using natural keys instead of surrogate integer keys; I'd clearly use surrogate keys.
Now, obviously, sales people may move regions mid year. Or mid month. So you have to create a SCD type in order to run this query. To me personally, Type 2 makes the most sense. So say you implement that. Say John Smith changed from East region to West region on May 15, 2015. You implement the following table:
name; region; start_date; end_date
John Smith; East; 2015-01-01; 2015-05-15
John Smith; West; 2015-5-15; 9999-12-31
Now the sales director asks the same question. What is the total sales revenue for the East for May 2015? Or moreover, show me the totals by region by month for the whole year. How would you structure the query?
SELECT region, month(date), sum(reveneue)
from Fact_Table inner join Dim_Table
on name = name
and date between start_date and end_date
group by region, month(date)
Would that give the correct results? I guess it might --- my question may be more along the lines of --- okay now assume you have 1 million records in the Fact table ... would this inner join be grossly inefficient, or is there a faster way to achieve this result?
Would it make more sense to write the SCD (like region) directly into a 'denormalized' Fact table --- and when the dimension changes, perhaps update a week or two's worth of Fact record' regions retroactively?

Your concept is correct if your business requirement has a hierarchy of Region->Seller, as shown in your example.
The performance of your current query may be challenging, but it will be improved by the use of appropriate dimension keys and attributes.
Use a date dimension hierarchy that includes date->Month, and you'll be able to avoid the range query.
Use integer, surrogate, keys in both dimensions and your indexing performance will improve.
One million rows is tiny, you won't have performance problems on any competent DBMS :)

Related

Data Warehouse Fact Constellation schema

I have two fact tables one depend on date date dimension (Day,month,year).
and the other depend on month and year only.
So my question do i need to create two dimension one has (day month year) and another dimension that only has year and month ?
Thank you .
A touch late here; sorry about that. Yes, you should build two dimension tables. I'd also recommend a relationship between them (i.e. each month has multiple days). Finally, and some consider this controversial, you might want to do more of a snowflake approach here and have the day level tables contain no information about months (eg month name, month number, etc.) beyond a link to the month table. The downside is that you'll almost always have to join the month table to the day table when you use the day table. Some feel this join is cheap and worth it for the benefit in reduced data redundancy. Others feel that any unnecessary join in a star is to be avoided.

SSAS Performance: Multiple measures+no Dim vs one measure+DimType

I am building a finance cube and trying to understand the best practice while designing my main fact table.
What do you think will be a better solution:
Have one column in the fact (amount) and have an additional field which will indicate the type of financial transaction (costs, income, tax, refund, etc).
T
TransType Amount Date
Costs 10 Aug-1
Income 15 Aug-1
Refunds 5 Aug-2
Costs 5 Aug-2
"Pivot" the table to create several columns according to the type of the transaction.
Costs Income Refund Date
10 15 NULL Aug-1
5 NULL 5 Aug-2
Of course, the cube will follow whatever option is selected - several real measures vs several calculated measures which each one of the are based on one main measure while being sliced on a member from a "Transaction Type" dimension.
(in general all transaction types has the same number of rows)
thank you in advanced.
Oren.
For a finance related cube, I believe it is much better to use account dimension functionality.
By using account dimension, you can add/remove accounts to the dimension without changing the structure of your model. Also if you use account dimension, time balance(aggregate function) functionality of the cube cube can help you a lot.
However SSAS account dimension has its own problems as well. For example, if you assign time balance to a formula or a hierachical parent, it is silently ignored and that is not documented as far as I know. So be ready to fix the calculations in the calculation script.
You can also use custom rollup member functionality to load your financial formulas.
In our case, we have 6000+ accounts, and the formulas can change without our control.
So having custom rollup member functionality helps a lot.
You need to be careful with solve orders(ratios..) etc, but that is as usual for any complicated/financial cube.

Why am I getting all the pizzas(relational algebra) and my joins messing up?

This is the database I am using for my queries
https://class.stanford.edu/c4x/DB/RA/asset/pizzadata.html
the syntax for writing out relational algebra queries is based off http://www.cs.duke.edu/~junyang/ra/ .
My query is to "Find all pizzas eaten by at least one female over the age of 20."
this is what I have so far
\project_{name,pizza}(
Person \join_{gender='female' and age>20} Eats
)
I think I have the right logic here.("\join_{cond} is the relational theta-join operator.") I also showed the name column for debugging purposes. I am joining two relations and only keeping the rows where gender is female and age is > 20.
The result of my query(against the correct query). I don't think this is a syntax issue. In the Eats relation, Fay only eats mushroom. I don't understand why she is paired with every pizza combination
Theta joins are cartesian; they join every row of each table with every row of every other table. In your example you are joining every row of Person where gender='female' and age>20 with every row of Eats, regardless of name. You probably want:
Person \join_{gender='female' and age>20 and name=eater} \rename{eater, pizza} Eats
Note that Thetas typically increase the number of rows; you typically reduce the number of rows returned using Sigmas or selections. A more idiomatic way of performing your statement would be with a Select and natural join:
\select{gender='female' and age>20} Person \join Eats

Multidimensional analysis in Hive/Impala

I have a denormalized table say Sales that looks like:
SalesKey,
SalesOfParts, SalesOfEquipments, CostOfSales as some numeric measures
Industry, Country, State, Sales area, Equipment id, customer id, year of sale, month of sale and some more similar dimensions. (Total of 12 dimensions)
I need to support aggregation queries on the Sales, like total number of sales in a year, month... total cost of them etc.
Also these aggregates need to be filtered, i.e. something like total sales in year 2013, 04 belonging to Manufacturing industry of XYZ customer.
I have these dimension tables and facts in hive/impala.
I do not think I can make a cube on all the dimensions. I read a paper to see how to do OLAP over multiple dimensions :
http://www.vldb.org/conf/2004/RS14P1.PDF
Which basically suggests to materialize cubes over small fragments and do some kind of runtime computation when query spans multiple cubes.
I am not sure how to implement this model in Hive/Impala. Any pointers/suggestions will be awesome.
EDIT: I have about 10 million rows in the Sales table, and the dimensions are not comparable to 100, but are around 12 ( might go upto 15) but have a good cardinality each.
I would build cubes using a 3rd-party software. For example, icCube is an in-memory OLAP server that can handle with no issue at all 10mio of rows over 12 dimensions. Then the response time would be sub-second in all dimensions. Moving out from Hive 10mio of rows does not seem to be an issue (you could use the JDBC driver for that purpose). icCube is specifically designed to handle properly high sparsity.

What dimensions to use for a fact table Incident"

I want to build a fact table that hold information about incidents.
The dimensions I suggested:
Time_Dimension : ID_Time, Year, Month, Day
Location_Dimension : ( City for exemple) :ID_City, name
But what i don't get is, the datamart is supposed to hold information about incidents,and i've noticed on some DWH design that incident is also used as a dimension. and i tell myself, what would be the benefit of the other dimensions ( i.e location dimension, time dimension ) if all the information on the fact table are already on the "incident" dimension ?
the measures to calculate are the "Cost Of Incident " (per month) and Number of Incident ( per month )
Having an incident dimension doesn't mean you would move the location and time into that dimension. An incident might have other attributes like who owns it, what type it is, etc. Those things would go in the incident dimension. If you have other things that tie to a location then you are doing the right thing to tie your incident dimension to the location dimension. And every fact should be tied to a date/time dimension.
It sounds like you are just getting started with dimensional modeling. You probably want to check out The Kimball Group to get a better understanding of how this works. The data warehouse toolkit book has many good examples that would help you understand how to model your data.

Resources