I want to build a fact table that hold information about incidents.
The dimensions I suggested:
Time_Dimension : ID_Time, Year, Month, Day
Location_Dimension : ( City for exemple) :ID_City, name
But what i don't get is, the datamart is supposed to hold information about incidents,and i've noticed on some DWH design that incident is also used as a dimension. and i tell myself, what would be the benefit of the other dimensions ( i.e location dimension, time dimension ) if all the information on the fact table are already on the "incident" dimension ?
the measures to calculate are the "Cost Of Incident " (per month) and Number of Incident ( per month )
Having an incident dimension doesn't mean you would move the location and time into that dimension. An incident might have other attributes like who owns it, what type it is, etc. Those things would go in the incident dimension. If you have other things that tie to a location then you are doing the right thing to tie your incident dimension to the location dimension. And every fact should be tied to a date/time dimension.
It sounds like you are just getting started with dimensional modeling. You probably want to check out The Kimball Group to get a better understanding of how this works. The data warehouse toolkit book has many good examples that would help you understand how to model your data.
Related
I am designing a Data Warehouse and need some help with my fact table.
My fact table is capturing the facts for aged debt, this table captures all transactions against bills.
The dimension keys i have are listed below:
dim_month_end_key
dim_customer_key
dim_billing_account_key
dim_property_key
dim_bill_key
dim_charge_key
dim_payment_plan_key
dim_income_type_key
dim_transaction_date_key
dim_bill_date_key
I am trying to work out what my level of granularity would be as all the keys together could be duplicated, let's say if a customer makes a payment twice in one day.
I am thinking to solve this i can add a time dimension as the time should always be different.
However the company do not need to report on time, do i add it to prevent duplication regardless?
Thanks
Cheryl
No you don't need a time dimension.
there may be an apparent duplication in your fact, but it will actually reflect 2 deposits in one day - so two valid records. the fact that you might not be able to tell the two transactions apart is not (necessarily) a problem for the system
the report will Sum all the deposits amounts, or count the number of deposits, along any dimension and the totals will still be fine.
I was creating some analysis on revenue for past years. One thing I noticed is measures of revenue for each month of a year are same for every year's corresponding months. That is revenue for April 2015 is same as revenue for April 2016.
I did some searching to solve this problem. I found that our measure column 'Revenue' is aggreagted based on time dimension as 'Last(sum(revenue))'. So actual revenue values of April 2019 is considered by OBIEE as last and copied to other year's April month revenue.
I can understand that keyword 'last' may be the reason of this, but shouldn't year, quarter, month columns choose exactly those numbers that corresponds to that date? Can someone explain how this works and suggest solutions, please?
Very simply put: The "LAST" is the reason. It doesn't "copy" the value though. It aggregates the values to the last existing value along the dimensional hierarchy specified.
The question is: What SHOULD that Saldo show? What is the real business rule?
Also lastly: Using technical column names and ALL UPPER CASE COLUMN NAMES in the BMM layer shouldn't be done. The names should be user-focused, readabla and pretty. Otherwise everybody has to go and change it 50 times over and over in the front-end.
It's been a year since I posted this question,but a fix for this incorrect representation of data was added today. In the previous version of rpd, we used another alternative solution to this by creating two measure columns of saldo ( saldo_year and saldo_month) and setting level for them at year and level respectively and using them both in an analysis. This was a temporary solution until we did the second version of our rpd since we realized that structure of the old one wasn't completely correct and it was easier and less time consuming to make it from ground and create a new one than to fix the old one.
So as #Chris mentioned, it was all about correct time dimension and hierarchies. We thought we created it with all requirements met, but recently we got the same problem in our analyses. Then we figured out that we didn't set id columns as primary key in month and quarter logical levels. After we got the data we want. If anybody faces this kind of problem, then the first thing to check in rpd is how the time dimension and hierarchy is defined, how logical levels and primary keys and chronological keys are set in hierarchy.
I am building a finance cube and trying to understand the best practice while designing my main fact table.
What do you think will be a better solution:
Have one column in the fact (amount) and have an additional field which will indicate the type of financial transaction (costs, income, tax, refund, etc).
T
TransType Amount Date
Costs 10 Aug-1
Income 15 Aug-1
Refunds 5 Aug-2
Costs 5 Aug-2
"Pivot" the table to create several columns according to the type of the transaction.
Costs Income Refund Date
10 15 NULL Aug-1
5 NULL 5 Aug-2
Of course, the cube will follow whatever option is selected - several real measures vs several calculated measures which each one of the are based on one main measure while being sliced on a member from a "Transaction Type" dimension.
(in general all transaction types has the same number of rows)
thank you in advanced.
Oren.
For a finance related cube, I believe it is much better to use account dimension functionality.
By using account dimension, you can add/remove accounts to the dimension without changing the structure of your model. Also if you use account dimension, time balance(aggregate function) functionality of the cube cube can help you a lot.
However SSAS account dimension has its own problems as well. For example, if you assign time balance to a formula or a hierachical parent, it is silently ignored and that is not documented as far as I know. So be ready to fix the calculations in the calculation script.
You can also use custom rollup member functionality to load your financial formulas.
In our case, we have 6000+ accounts, and the formulas can change without our control.
So having custom rollup member functionality helps a lot.
You need to be careful with solve orders(ratios..) etc, but that is as usual for any complicated/financial cube.
(This is a theoretical question for a system design I am working on - advising changes is great)
I have a large table of GPS data which contains the following rows:
locID - PK
userID - User ID of the user of the app
lat
long
timestamp - point in UNIX time when this data was recorded
I am trying to design a way which will allow a server to go through this dataset and check if any "users" were in a specific "place" together (eg. 50m apart) at a specific time range (2min) - eg. did user1 visit the same vicinity of user2 within that 2min time gap.
The only way I can currently think of is check each row one by one with all the rows in the same timeframe using a co-ordinate distance check algorithm. - But this comes up with the issue if the users are all around the world and have thousands maybe millions of rows in that 5min timeframe this would not work efficiently.
Also what if I want to know how long they were in each others vicinity?
Any ideas/thoughts would be helpful. Including the database to use? I am thinking either PostgreSQL or maybe Cassandra. And the table layout. All help appreciated.
Divide the globe into patches, where each patch is small enough to contain only a few thousand people, say 200m by 200m, and add the patchID as an attribute to each entry in the database. Note that two users cannot be in close proximity if they aren't in the same patch or in adjacent patches. Therefore, when checking for two users in the same place at a given time, query the database for a given patchID and the eight surrounding patchIDs, to get a subset of the database that may contain possible collisions.
I am in the process of merging two data sets together in Stata and came up with a potential concern.
I am planning on sorting each data set in exactly the same manner on several categorical variables that are common to both sets of data. HOWEVER, several of the categorical variables have more categories present in one data set over the other. I have been careful enough to ensure that the coding matches up in both data sets (e.g. Red is coded as 1 in both data set A and B, but data set A has only Red, Green and Blue whereas data set B has Red, Green, Blue, and Yellow).
If I were to sort each data set the same way and generate an id variable (gen id = _n) and merge on that, would I run into any problems?
There is no statistical question here, as this is purely about data management in Stata, so I too shall shortly vote for this to be migrated to Stack Overflow, where I would be one of those who might try to answer it, so I will do that now.
What you describe to generate identifiers is not how to think of merging data sets, regardless of any of the other details in your question.
Imagine any two data sets, and then in each data set, generate an identifier that is based on the observation numbers, as you propose. Generating such similar identifiers does not create a genuine merge key. You might as well say that four values "Alan" "Bill" "Christopher" "David" in one data set can be merged with "William" "Xavier" "Yulia" "Zach" in another data set because both can be labelled with observation numbers 1 to 4.
My advice is threefold:
Try what you are proposing with your data and try to understand the results.
Consider whether you have something else altogether, namely an append problem. It is quite common to confuse the two.
If both of those fail, come back with a real problem and real code and real results for a small sample, rather than abstract worries.
I think I may have solved my problem - I figured I would post an answer specifically relating to the problem in case anybody has the same issue.
~~
I have two data sets: One containing information about the amount of time IT help spent at a customer and another data set with how much product a customer purchased. Both data sets contain unique ID numbers for each company and the fiscal quarter and year that link the sets together (e.g. ID# 1001 corresponds to the same company in both data sets). Additionally, the IT data set contains unique ID numbers for each IT person and the customer purchases data set contains a unique ID number for each purchase made. I am not interested in analysis at the individual employee level, so I collapsed the IT time data set to the total sum of time spent at a given company regardless of who was there.
I was interested in merging both data sets so that I could perform analysis to estimate some sort of "responsiveness" (or elasticity) function linking together IT time spent and products purchased.
I am certain this is a case of "merging" data because I want to add more VARIABLES not OBSERVATIONS - that is, I wish to horizontally elongate not vertically elongate my final data set.
Stata 12 has many options for merging - one to one, many to one, and one to many. Supposing that I treat my IT time data set as my master and my purchases data set as my merging set, I would perform a "m:1" or many to one merge. This is because I have MANY purchases corresponding to one observation per quarter per company.