how to put data in fact table? - etl

i'm new in business intelligence
and i design a star schema that implement a data mart to help analyst to take a decision about student grades
dimensions tables :
- module (module code, module name) that contains information about the module
- student ( code, first_name, last_name, ....) that contains information about the model
- school subject ( code, name, professor name...)
- degree ( code, libelle)
- specialite (code, libelle)
- time(year,half year)
- geographie(continent,country,city)
fact table :
- result ( score, module score, year score)
the data source is excel file :
in each file i have a set of sheet for each sheet he present a students score in "Niveau 'X' , Specialite 'Y', Year and Half-Year 'Z',Module 'U',City 'A'...
my question is :
how i can't put data from excel to my dimensions and fact
to dimensions i suppose that is easy but i need your proposition
to fact i have no idea
i'm sorry for my bad english

Most basic answer, pick an ETL tool and start moving the data.
You will generally need to:
Load your dimension tables first. The ID columns in these tables will link to the fact table.
In your ETL package/routine to populate the fact table,
select the data to be placed in the fact table from the source/staging.
do a lookup on each of the dimension tables against this data to get the ID of each Dimension value.
Finally do some duplicate detection to see if any of the rows are already in the fact table.
insert the data.
This process will be broadly similar regardless of the ETL tool you use. There are a few tutorials that go into some detail (use google) but the basic technique is lookups to get the dimension keys.

Related

CDC strategy for multiple staging tables

I'm implementing a Data Mart following the Kimball methodology and I have a challenge with applying deltas from multiple source tables against a single target dimension.
Here's an example of the incoming source data:
STG_APPLICATION
APP_ID, APP_NAME, APP_START_DATE, CDC_HASH, ...
1, FOOBAR, 20/10/2018, MD5_XXX
STG_APPLICATION_STATUS
APP_ID, STATUS_CODE, STATUS_DESC, CDC_HASH, ...
1, SUBMITTED, "APP WAS SUBMITTED", MD5_YYY
Each of these tables (there are several others) represent a normalised version of the source data i.e. a single application can have one or more statuses associated with it.
Now then, because we only get a full alpha for these tables we have to do a snapshot merge, i.e. apply a full outer join on the current day set of records against the previous day set of records for each individual table. This is computed by comparing the CDC_HASH (a concat of all source columns). The result of this comparison is stored in a delta table as follows:
STG_APPLICATION_DELTA
APP_ID, APP_NAME, APP_START_DATE, CDC_HASH, CDC_STATUS ...
STG_APPLICATION_STATUS
APP_ID, STATUS_CODE, STATUS_DESC, CDC_HASH, CDC_STATUS...
1, AWARDED, "APP WAS AWARDED", MD5_YYY, NEW
So in this example, the first table, STG_APPLICATION did not generate a delta record as the attributes pertaining to that table did not change between daily loads. However, the associated table, STG_APPLICATION_STATUS, did calculate a delta, i.e. one or more fields have changed since the last load. This is highlighted by the CDC_STATUS which identifies it as a new record to insert.
The problem now of course is how to correctly handle this situation when loading the target dimension? For example:
DIM_APPLICATION
ID, APPLICATION_ID, APP_NAME, APP_START_DATE, APP_STATUS_CODE, FROM_DATE, TO_DATE
1, 1, FOOBAR, 20/10/2018, SUBMITTED, 20/10/2018, 12/04/2019
2, 1, NULL, NULL, NULL, AWARDED, 13/04/2019, 99/99/9999
This shows the first record - based on these two staging tables being joined - and the second record which is meant to reflect an updated version of the record. However, as previously illustrated, my Delta tables are only partially populated, and therefore I am unable to correctly update the dimension as shown here.
Logically, I understand that I need to be able to include all fields that are used by the dimension as part of my delta calculation, so that I have a copy of a full record when updating the dimension, but I'm not sure of the best way to implement this in my staging area. As shown already, I currently only have independent staging tables, each of which calculate their delta separately.
Please can somebody advise on the best way to handle this? I'm scrutinized Kimball's books on this but to no avail. And I've equally found no suitable answer on any online forums. This is a common problem so I'm sure there exists a suitable architectural pattern to resolve this.
You will need to either compare on joined records or lookup the current dimension values.
If the amount of (unchanged) data is not excessive, you could join the full snapshots of STG_APPLICATION and STG_APPLICATION_STATUS together on APP_ID until they resemble the dimension record column-wise and store those in a separate table with their CDC hash to use as previous day. You then take the deltas at this level and send the (complete) changed records as updates to the dimension.
If the amount of records in the daily update makes it impractical to join the full tables, you can take the deltas and full outer join them as you do now. Then you look up the current dimension record for this APP_ID and fill in all empty fields in the delta record. The completed record is then sent as an update to the dimension.
This solution requires less storage but seems more fragile, especially if multiple changes are possible within a day. If there are many changes, performance may also suffer. For a handful of changes in millions of records, it should be more efficient.

Unreliable results of a query

I need to get sales figures from open orders, sorted by code. The items are separated in the stock table by lot number (for traceability reasons) but the lot numbers do not appear in the orders table. The only link between the 2 tables is the part number.
When my query
SELECT code, SUM(qty*price) AS Sales
FROM orders INNER JOIN stock ON orders.partno = stock.partno
GROUP BY code
started returning strange results (very high sales figures for a given code), I changed it to
SELECT DISITNCT orders.partno, stock.lot, stock.code
FROM orders INNER JOIN stock ON orders.partno = stock.partno
and noticed that if several lots of a given part are in stock they are all returned
Part1 LotA code
Part1 LotB code
Part1 LotC code
which means that if a customer orders 300 units of Part1, my query returns 900 and my sales figure is multiplied by 3.
How can I work around that?
It must be noted that I do not work from a database but from a group of tables, the structures of which can sometimes be whimsical.
You should really use table.column or alias.column reference when writing queries. As your question stands, we do not know which table the PRICE comes from... the parts table or the lots table. If you are dealing with inventory tracking such as FIFO or LIFO method accounting, you must have an association to the lot table for inventory being tracked/sold.
Now, why are you getting large numbers? That is because of a Cartesian result. If you are not familiar with that, for each record in one table joined to another, it is returning however many matches.
So, if you have an order of one line item, there is only one line item in a products available table. So this is simple 1:1 ratio. Now, you have your STOCK table that can have multiple records for the exact same part number. You are now returning the same original order line item for EACH LOT ENTRY in the Stock table. So now, for your 1 item, you are getting 3 lots (1:3 result).
I know this is important from a cost-of-goods sold basis, hence your need to know which "lot" it is joined to so you only get that one specific record for proper pricing.
If however, you do have a generic product table of everything you sell, and that table has a generic common price no matter which "lot" was used for the sale, I would join to that table instead for your report. But you will still have the accounting issue of inventory, cost-of-goods, etc.

BIRT Report Designer - split actual and budget data stored in one table into columns and add a variance

I have financial data in the following format in a SQL database and I have to live with this format unfortunately (example dummy data below).
I have however been struggling to get it into the following layout in a BIRT report.
I have tried creating a data cube with Package, Flow and Account as Dimensions and Balance as a Measure, but that groups actual PER and actual YTD next to each other and budget PER and YTD next to each-other etc so is not quite what I need.
The other idea I had was to create four new calculated columns, the first would only have a value if it were a line for actual and per, the next only if it was actual and ytd etc, but could not get the IF function working in the calculated column.
What are the options? Can someone point me in the direction of how to best create the above layout from this data structure so I can take it from there?
Thanks in advance.
I am not sure what DB you are using in the back end, but this is how I did it with SQL Server.
The important bit happens in the Data Set. Here is the SQL for my Data Set:
SELECT
Account,
Package,
Flow,
Balance
FROM data
UNION
SELECT DISTINCT
Account,
'VARIANCE',
Flow,
(SELECT COALESCE(SUM(Balance),0) FROM data WHERE Account = d.Account AND Flow = d.Flow AND Package = 'ACTUAL') - (SELECT COALESCE(SUM(Balance), 0) FROM data WHERE Account = d.Account AND Flow = d.Flow AND Package = 'BUD') as Balance
FROM data d
This gives me a table like:
Then I created a DataCube that contained
Groups/Dimensions
Account
Flow
Package
Summary Fields/Measures
Balance
Then I created a CrossTab Report that was based on that DataCube
And this produces the result of:
Hopefully this helps.

SSRS Tablix Cell Calculation based on RowGroup Value

I have looked through several of the posts on SSRS tablix expressions and I can't find the answer to my particular issue.
I have a dashboard I am creating that contains summary data for various managers. They are entering monthly summary data into a single table structured like this:
Create TABLE OperationMetrics
AS
Date date
Plant char(10)
Sales float
ReturnedProduct float
The data could use some grouping so I created a table for referencing which report group these metrics go into looks like this:
Create Table OperationsReport
as
ReportType varchar(50)
MetricType varchar(50)
In this table, 'Sales' and 'ReturnedProduct' are the Metric column, while 'ExecSummary' or 'Quality' are ReportType entries. To do the join, I decided to UNPIVOT the OperationMetrics table...
Select Date, Plant, Metric, MetricType
From (Select Date, Plant, Sales, ReturnedProduct From OperationMetrics)
UNPVIOT (Metric for MetricType in (Sales, ReturnedProduct) UnPvt
and join it to the OperationsReport table so I have grouped metrics.
Select Date, Plant, Metric, Rpt.MetricReport, MetricType
FROM OpMetrics_Unpivoted OpEx
INNER JOIN OperationsReport Rpt on OpEx.MetricType = Rpt.MetricType
(I understand that elements of this is not ideal but sometimes we are not in control of our destiny.)
This does not include the whole of the tables but you get the gist. So, they have a form they fill in the OperationMetrics table. I chose SSRS to display the output.
I created a tablix with the following configuration (I can't post images due to my rep...)
Date is the only column group, grouped on 'MMM-yy'
Parent Row Group is the ReportType
Child Row Group is the MetricType
Now, my problem is that some of the metrics are calculations of other metrics. For instance, 'Returned Product (% of Sales)' is not entered by the manager because it is assumed we can simply calculate that. It would be ReturnedProduct divided by Sales.
I attempted to calculate this by using a lookup function, as below:
Switch(Fields!FriendlyName.Value="Sales",SUM(Fields!Metric.Value),
Fields!FriendlyName.Value="ReturnedProduct",SUM(Fields!Metric.Value),
Fields!FriendlyName.Value="ReturnedProductPercent",Lookup("ReturnedProduct",
Fields!FriendlyName.Value,Fields!Metric.Value,"MetricDataSet")/
Lookup("Sales",Fields!FriendlyName.Value,Fields!Metric.Value,
"MetricDataSet"))
This works great! For the first month... but since Lookup looks for the first match, it just posts the same value for the rest of the months after.
I attempted to use this but it got me back to where I was at the beginning since the dataset does not have the value.
Any help with this would be well received. I would like to keep the rowgroup hierarchy.
It sounds like the LookUp is working for you but you just need to include the date to find the right month. LookUp will return the first match which is why it's only working on the first month.
What you can try is concatenating the Metric Name and Date fields in the LookUp.
Lookup("Sales" & CSTR(Fields!DATE.Value), Fields!FriendlyName.Value & CSTR(Fields!DATE.Value), Fields!Metric.Value, "MetricDataSet")
Let me know if I misunderstood the issue.

Fact table organization

I am participating in creation of reporting software which utilizes Kimball star schema methodology. Entire team (including me) hasn't worked with this technology so we are new in this.
There are couple of dimension and fact tables in or system so far. For example:
- DIM_Customer (dimension table for customers)
- DIM_BusinessUnit (dimension table for business units)
- FT_Transaction (fact table, granularity per transaction)
- FT_Customer (fact table for customer, customer id and as on date are in composite PK)
This is current structure of FT_Customer:
- customer_id # (customer id, part of composite PK)
- as_on_date # (date of observation, part of composite PK)
- waic (KPI)
- wat (KPI)
- waddl (KPI)
- wadtp (KPI)
- aging_bucket_current (KPI)
- aging_bucket_1_to_10 (KPI)
- aging_bucket_11_to_25 (KPI)
- ... ...
Fields waic, wat, waddl and wadtp are related to delay in transaction payment. These fields are calculated by aggregation query against FT_Transaction table grouped by customer_id and as_on_date.
Fields aging_bucket_current, aging_bucket_1_to_10 and aging_bucket_11_to_25 contains number of transactions categorized by delay in payment. For example aging_bucket_current contains number of transactions that are payed on time, aging_bucket_1_to_10 contains number of transactions that are payed with 1 to 10 days delay ...
This structure is used for report generation from PHP web application as well as Cognos studio. We discussed about restructuring FT_Customer table in order that make it more usable for external systems like Cognos.
New proposed structure of FT_Customer:
- customer_id # (customer id, part of composite PK)
- as_on_date # (date of observation, part of composite PK)
- kpi_id # (id of KPI, foreign key that points to DIM_KPI dimension table, part of composite PK)
- kpi_value (value KPI)
- ... ...
For this proposal we will have additional dimension table DIM_KPI:
- kpi_id #
- title
This table will contain all KPIs (wat, waic, waddl, aging buckets ...).
Second structure of FT_Customer will obviously have more rows than current structure.
Which structure of FT_Customer is more universal?
Is it acceptable to keep both structures in separate tables? This will obviously put additional burden to ETL layer because some of work will be done twice, but on the other side it will make easier generation of various reports.
Thanks in advance for suggenstions.
The 1st structure seems to be more natural and common to me. However, the 2nd one is more flexible, because it supports adding new KPIs without changing the structure of the fact table.
If different ways of accessing data actually require different structures, there is nothing wrong about having two fact tables with the same data, as long as:
both tables are always loaded together (not necessarily in parallel, but within the same data load job/workflow),
measures calculation are consistent (reuse the logic if possible).
You should test the results for any data inconsistencies.
Before you proceed, go buy yourself Agile Data Warehouse Design and read it thoroughly. It's pretty cheap.
http://www.amazon.com/Agile-Data-Warehouse-Design-Collaborative/dp/0956817203
Your fact tables are for processes or events that you want to analyze. You should name them noun_verb_noun (example customers_order_items). If you can't come up with a name like that, you probably don't have a fact table. What is your Customer Fact table for? Customer is usually a dimension table.
The purpose of your data warehouse is to facilitate analysis. Use longer column names (with _ as word separator). Make life easy on your analysts.

Resources