Where does DAX MTD calc go? - dax

The stack is SQL relational tables into SQL 2014 Tabular consumed by Excel 2010.
The Tabular model grain is one row per purchase order (PO) line item. Each row has a dollar value (item cost$) which is used for a sum measure (total cost$).
A Time Intelligence Date table is related so the sum of total costs$ for a year can be determined for example.
How best to implement a month to date aggregate? Should a DAX query against the model calculate the MTD on the fly as I pull the data in Excel? Or is there a way to implement directly into the model at the PO line item grain?

I personally would stick a calculated measure in the model itself using TOTALMTD() - you could always just use SQL and do it in the back end though. Calculated measures are generally pretty efficient - your model would have to be huge before you start seeing performance issues with them so I wouldn't worry too much

Related

Slowly changing dimension join performance

General Overview: I have an Oracle table 'product' that contains approximately 80 million records and I would like to improve the performance of joins that use this table. In most cases we are interested in a very small subset of records from (table) 'product' with (column) 'valid_until' date (value) 'mm/dd/9999'.
Possible solutions:
Partition 'mm/dd/9999' and use partition exchange to quickly load new data.
Use an index on 'valid_until' date.
Do you guys have any other possible Oracle solutions or ideas?
Based on needing to find 1% of records, I would expect an index to be adequate. It might pay to include the PK of the table as well if the query is just to find that for the current products.
If there is not a need to identify records by other valid_until dates then it might be worth using Oracle's equivalent of a partial index by indexing on:
case value_until
when date '...whatever the date is...'
then valid_until
else null
end
... but that would mean changing the schema or the tool that generates the queries or both.
You might keep an eye on the table's statistics to make sure that the cardinality of the selected rows is subject to a reasonably accurate estimation.
I wouldn't go for a partition-based solution as a first choice, as the overhead of row-migration during the update of the valid_until values would be fairly high, but if an index cannot deliver the query performance then by all means try.

SSAS Performance: Multiple measures+no Dim vs one measure+DimType

I am building a finance cube and trying to understand the best practice while designing my main fact table.
What do you think will be a better solution:
Have one column in the fact (amount) and have an additional field which will indicate the type of financial transaction (costs, income, tax, refund, etc).
T
TransType Amount Date
Costs 10 Aug-1
Income 15 Aug-1
Refunds 5 Aug-2
Costs 5 Aug-2
"Pivot" the table to create several columns according to the type of the transaction.
Costs Income Refund Date
10 15 NULL Aug-1
5 NULL 5 Aug-2
Of course, the cube will follow whatever option is selected - several real measures vs several calculated measures which each one of the are based on one main measure while being sliced on a member from a "Transaction Type" dimension.
(in general all transaction types has the same number of rows)
thank you in advanced.
Oren.
For a finance related cube, I believe it is much better to use account dimension functionality.
By using account dimension, you can add/remove accounts to the dimension without changing the structure of your model. Also if you use account dimension, time balance(aggregate function) functionality of the cube cube can help you a lot.
However SSAS account dimension has its own problems as well. For example, if you assign time balance to a formula or a hierachical parent, it is silently ignored and that is not documented as far as I know. So be ready to fix the calculations in the calculation script.
You can also use custom rollup member functionality to load your financial formulas.
In our case, we have 6000+ accounts, and the formulas can change without our control.
So having custom rollup member functionality helps a lot.
You need to be careful with solve orders(ratios..) etc, but that is as usual for any complicated/financial cube.

Multidimensional analysis in Hive/Impala

I have a denormalized table say Sales that looks like:
SalesKey,
SalesOfParts, SalesOfEquipments, CostOfSales as some numeric measures
Industry, Country, State, Sales area, Equipment id, customer id, year of sale, month of sale and some more similar dimensions. (Total of 12 dimensions)
I need to support aggregation queries on the Sales, like total number of sales in a year, month... total cost of them etc.
Also these aggregates need to be filtered, i.e. something like total sales in year 2013, 04 belonging to Manufacturing industry of XYZ customer.
I have these dimension tables and facts in hive/impala.
I do not think I can make a cube on all the dimensions. I read a paper to see how to do OLAP over multiple dimensions :
http://www.vldb.org/conf/2004/RS14P1.PDF
Which basically suggests to materialize cubes over small fragments and do some kind of runtime computation when query spans multiple cubes.
I am not sure how to implement this model in Hive/Impala. Any pointers/suggestions will be awesome.
EDIT: I have about 10 million rows in the Sales table, and the dimensions are not comparable to 100, but are around 12 ( might go upto 15) but have a good cardinality each.
I would build cubes using a 3rd-party software. For example, icCube is an in-memory OLAP server that can handle with no issue at all 10mio of rows over 12 dimensions. Then the response time would be sub-second in all dimensions. Moving out from Hive 10mio of rows does not seem to be an issue (you could use the JDBC driver for that purpose). icCube is specifically designed to handle properly high sparsity.

Oracle Date Range Inconsistancy

I am running a fairly large query on a specific range of dates. The query takes about 30 seconds EXCEPT when I do a range of 10/01/2011-10/31/2011. For some reason that range never finishes. For example 01/01/2011-01/31/2011, and pretty much every other range, finishes in the expected time.
Also, I noticed that doing smaller ranges, like a week, takes longer than larger ranges.
When Oracle gathers statistics on a table, it will record the low value and the high value in a date column and use that to estimate the cardinality of a predicate. If you create a histogram on the column, it will gather more detailed information about the distribution of data within the column. Otherwise, Oracle's cost based optimizer (CBO) will assume a uniform distribution.
For example, if you have a table with 1 million rows and a DATE column with a low value of January 1, 2001 and a high value of January 1, 2011, it will assume that the approximately 10% of the data is in the range January 1, 2001 - January 1, 2002 and that roughly 0.027% of the data would come from some time on March 3, 2008 (1/(10 years * 365 days per year + leap days).
So long as your queries use dates from within the known range, the optimizer's cardinality estimates are generally pretty good so its decisions about what plan to use are pretty good. If you go a bit beyond the upper or lower bound, the estimates are still pretty good because the optimizer assumes that there probably is data that is larger or smaller than it saw when it sampled the data to gather the statistics. But when you get too far from the range that the optimizer statistics expect to see, the optimizer's cardinality estimates get too far out of line and it eventually chooses a bad plan. In your case, prior to refreshing the statistics, the maximum value the optimizer was expecting was probably September 25 or 26, 2011. When your query looked for data for the month of October, 2011, the optimizer most likely expected that the query would return very few rows and chose a plan that was optimized for that scenario rather than for the larger number of rows that were actually returned. That caused the plan to be much worse given the actual volume of data that was returned.
In Oracle 10.2, when Oracle does a hard parse and generates a plan for a query that is loaded into the shared pool, it peeks at the bind variable values and uses those values to estimate the number of rows a query will return and thus the most efficient query plan. Once a query plan has been created and until the plan is aged out of the shared pool, subsequent executions of the same query will use the same query plan regardless of the values of the bind variables. Of course, the next time the query has to be hard parsed because the plan was aged out, Oracle will peek and will likely see new bind variable values.
Bind variable peeking is not a particularly well-loved feature (Adaptive Cursor Sharing in 11g is much better) because it makes it very difficult for a DBA or a developer to predict what plan is going to be used at any particular instant because you're never sure if the bind variable values that the optimizer saw during the hard parse are representative of the bind variable values you generally see. For example, if you are searching over a 1 day range, an index scan would almost certainly be more efficient. If you're searching over a 5 year range, a table scan would almost certainly be more efficient. But you end up using whatever plan was chosen during the hard parse.
Most likely, you can resolve the problem simply by ensuring that statistics are gathered more frequently on tables that are frequently queried based on ranges of monotonically increasing values (date columns being by far the most common such column). In your case, it had been roughly 6 weeks since statistics had been gathered before the problem arose so it would probably be safe to ensure that statistics are gathered on this table every month or every couple weeks depending on how costly it is to gather statistics.
You could also use the DBMS_STATS.SET_COLUMN_STATS procedure to explicitly set the statistics for this column on a more regular basis. That requires more coding and work but saves you the time of gathering statistics. That can be hugely beneficial in a data warehouse environment but it's probably overkill in a more normal OLTP environment.

How do I sort an Excel 2010 pivot table based on a subset of the data it contains?

I have an Excel 2010 pivot table that has categories and a count measure as the data. Those categories then have a date dimension nested underneath, filtered to show only the last two months.
When I sort the categories, I am sorting them by the total of the count measure across both June and July, in descending order.
Can anyone suggest how I can sort the categories based on the June data alone, as opposed to the total for both June and July?
Thanks!
Your questions is not related to Sql Server Analysis Services. SSAS provides multidimensional data that is also used by pivot tables as datasource. So that's why you have seen pivot table questions here. But they are not Excel related.
Anyway, i want to try to answer your question. As far as i understand your question, changing the order of the dimensions in your pivot table will be sufficient to achieve your goal. Add the date dimension to the pivot table first and then the category dimension to get your data grouped by date (month). You may then sort by categories to get the result you want.
Hope this help.

Resources