SSAS Performance: Multiple measures+no Dim vs one measure+DimType - performance

I am building a finance cube and trying to understand the best practice while designing my main fact table.
What do you think will be a better solution:
Have one column in the fact (amount) and have an additional field which will indicate the type of financial transaction (costs, income, tax, refund, etc).
T
TransType Amount Date
Costs 10 Aug-1
Income 15 Aug-1
Refunds 5 Aug-2
Costs 5 Aug-2
"Pivot" the table to create several columns according to the type of the transaction.
Costs Income Refund Date
10 15 NULL Aug-1
5 NULL 5 Aug-2
Of course, the cube will follow whatever option is selected - several real measures vs several calculated measures which each one of the are based on one main measure while being sliced on a member from a "Transaction Type" dimension.
(in general all transaction types has the same number of rows)
thank you in advanced.
Oren.

For a finance related cube, I believe it is much better to use account dimension functionality.
By using account dimension, you can add/remove accounts to the dimension without changing the structure of your model. Also if you use account dimension, time balance(aggregate function) functionality of the cube cube can help you a lot.
However SSAS account dimension has its own problems as well. For example, if you assign time balance to a formula or a hierachical parent, it is silently ignored and that is not documented as far as I know. So be ready to fix the calculations in the calculation script.
You can also use custom rollup member functionality to load your financial formulas.
In our case, we have 6000+ accounts, and the formulas can change without our control.
So having custom rollup member functionality helps a lot.
You need to be careful with solve orders(ratios..) etc, but that is as usual for any complicated/financial cube.

Related

Do i need a time dimension for my fact table to prevent duplication?

I am designing a Data Warehouse and need some help with my fact table.
My fact table is capturing the facts for aged debt, this table captures all transactions against bills.
The dimension keys i have are listed below:
dim_month_end_key
dim_customer_key
dim_billing_account_key
dim_property_key
dim_bill_key
dim_charge_key
dim_payment_plan_key
dim_income_type_key
dim_transaction_date_key
dim_bill_date_key
I am trying to work out what my level of granularity would be as all the keys together could be duplicated, let's say if a customer makes a payment twice in one day.
I am thinking to solve this i can add a time dimension as the time should always be different.
However the company do not need to report on time, do i add it to prevent duplication regardless?
Thanks
Cheryl
No you don't need a time dimension.
there may be an apparent duplication in your fact, but it will actually reflect 2 deposits in one day - so two valid records. the fact that you might not be able to tell the two transactions apart is not (necessarily) a problem for the system
the report will Sum all the deposits amounts, or count the number of deposits, along any dimension and the totals will still be fine.

How to create visualization using ratio of fields

I have a data set similar to the table below (simplified for brevity)
I need to calculate the total spend per conversion per team for every month, with ability to plot this as time based line chart being an additional nicety. The total spend is equal to the sum of Phone Expenditure, Travel allowance & Misc. Allowance, this can be a calculated field.
I cannot add a calculated field for the ratio, as for some sales person, the number of conversion can be 0 for a given month. So, averaging over team is not option. How can I go about this?
Thanks for help and suggestions in advance!
I've discussed the question with the Harish offline. I've learned that he is trying to calculate ratio per group, not per row.
To perform calculations per group, users can add calculated fields inside a QuickSight analysis and use level aware aggregation expressions. (Note that level aware aggregations can only be used in an analysis, not in the data prep view). Here is a link to the documentation about level aware aggregations if you want to learn more about this area https://docs.aws.amazon.com/quicksight/latest/user/level-aware-aggregations.html

Which machine learning algorithm I have to use for sequence prediction?

I have a dataset like below. I have datetime column as index, type is a column with sequence. For ex; R,C,D,D,D,R,R is a sequence.
start_time type
2019-12-14 09:00:00 RCDDDRR
2019-12-14 10:00:00 CCRD
2019-12-14 11:00:00 DDRRCC
2019-12-14 12:00:00 ?
I want to predict what would be the next sequence at time 12:00:00? which is the best algorithm to predict the next sequence?
I know that we can use Markov chain to predict the probable sequence. However, are there any other better algorithms?
Thanks
you can use from knn,svm for prediction.but the first of all you have to change database and define feature for training dataset for example
you can use from another method base on deep learning , I think this link can help you
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
LSTMs have an edge over conventional feed-forward neural networks and RNN in many ways. This is because of their property of selectively remembering patterns for long durations of time.
LSTMs on the other hand, make small modifications to the information by multiplications and additions. With LSTMs, the information flows through a mechanism known as cell states. This way, LSTMs can selectively remember or forget things. The information at a particular cell state has three different dependencies.
Let’s take the example of predicting stock prices for a particular stock. The stock price of today will depend upon:
The trend that the stock has been following in the previous days, maybe a downtrend or an uptrend.
The price of the stock on the previous day, because many traders compare the stock’s previous day price before buying it.
The factors that can affect the price of the stock for today. This can be a new company policy that is being criticized widely, or a drop in the company’s profit, or maybe an unexpected change in the senior leadership of the company.
These dependencies can be generalized to any problem as:
The previous cell state (i.e., the information that was present in the memory after the previous time step).
The previous hidden state (this is the same as the output of the previous cell).
The input at the current time step (i.e., the new information that is being fed in at that moment).
Maybe this link and method could help you
https://www.bioinf.jku.at/publications/older/2604.pdf
https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/

Multidimensional analysis in Hive/Impala

I have a denormalized table say Sales that looks like:
SalesKey,
SalesOfParts, SalesOfEquipments, CostOfSales as some numeric measures
Industry, Country, State, Sales area, Equipment id, customer id, year of sale, month of sale and some more similar dimensions. (Total of 12 dimensions)
I need to support aggregation queries on the Sales, like total number of sales in a year, month... total cost of them etc.
Also these aggregates need to be filtered, i.e. something like total sales in year 2013, 04 belonging to Manufacturing industry of XYZ customer.
I have these dimension tables and facts in hive/impala.
I do not think I can make a cube on all the dimensions. I read a paper to see how to do OLAP over multiple dimensions :
http://www.vldb.org/conf/2004/RS14P1.PDF
Which basically suggests to materialize cubes over small fragments and do some kind of runtime computation when query spans multiple cubes.
I am not sure how to implement this model in Hive/Impala. Any pointers/suggestions will be awesome.
EDIT: I have about 10 million rows in the Sales table, and the dimensions are not comparable to 100, but are around 12 ( might go upto 15) but have a good cardinality each.
I would build cubes using a 3rd-party software. For example, icCube is an in-memory OLAP server that can handle with no issue at all 10mio of rows over 12 dimensions. Then the response time would be sub-second in all dimensions. Moving out from Hive 10mio of rows does not seem to be an issue (you could use the JDBC driver for that purpose). icCube is specifically designed to handle properly high sparsity.

What dimensions to use for a fact table Incident"

I want to build a fact table that hold information about incidents.
The dimensions I suggested:
Time_Dimension : ID_Time, Year, Month, Day
Location_Dimension : ( City for exemple) :ID_City, name
But what i don't get is, the datamart is supposed to hold information about incidents,and i've noticed on some DWH design that incident is also used as a dimension. and i tell myself, what would be the benefit of the other dimensions ( i.e location dimension, time dimension ) if all the information on the fact table are already on the "incident" dimension ?
the measures to calculate are the "Cost Of Incident " (per month) and Number of Incident ( per month )
Having an incident dimension doesn't mean you would move the location and time into that dimension. An incident might have other attributes like who owns it, what type it is, etc. Those things would go in the incident dimension. If you have other things that tie to a location then you are doing the right thing to tie your incident dimension to the location dimension. And every fact should be tied to a date/time dimension.
It sounds like you are just getting started with dimensional modeling. You probably want to check out The Kimball Group to get a better understanding of how this works. The data warehouse toolkit book has many good examples that would help you understand how to model your data.

Resources