Calculate cumulative total in DAX? - dax

I'm calculating cumulative total in DAX like:
DEFINE MEASURE 'Sales'[Running Total] =
CALCULATE (
SUM('Sales'[Revenue]),
FILTER(ALL('Date'[Date]),'Date'[Date]<=MAX('Date'[Date]))
)
This should be well-estabilished pattern (at least it is referenced here: http://www.daxpatterns.com/cumulative-total/)
My problem is when I try to evaluate it like:
EVALUATE SUMMARIZECOLUMNS(
'Date'[Date],
"Total_Revenue_By_Date",
'Sales'[Running Total]
)
I'm running into error
The resultset of a query to external data source has exceeded the maximum allowed size of '1000000' rows.
I'm using tabular model with direct query. I know I can enlarge the limit, however the underlying tables are small - Date table has around 10000 rows, Sales table has around 10000 rows as well (it will be much larger on production), so something here doesn't scale well.
I have an idea how to get away with calculating running totals on SQL level, any idea how to tackle this on DAX level ?

Models created by Power BI desktop has default limit of 1 million rows.
This might help you,
https://www.sqlbi.com/articles/tuning-query-limits-for-directquery/

Related

Power BI Matrix Totals calculating incorrectly for

I am currently having an issue with the power bi Matrix Calculation when using dax. I need to calculate the Running total for overtime per fortnight, which I have achieved using the following
Lieu running total to Date = CALCULATE(SUM('Table1'[OT]),FILTER(ALLSELECTED('Calendar'[Date]),ISONORAFTER('Calendar'[Date], MAX('Calendar'[Date]), DESC)))
However I now need to calculate the excess hours (OT) which I have used the following to calculate people over 90 hours(additional 10 hours)
Excess Lieu2 = IF([Lieu running total in Date]>=160,[Lieu running total in Date]-160,0)
The issues is the the grand totals is calculating the entire total - 160
The the last few total rows as well as the grand total are aggregating incorrectly...ANy help is greatly appreciated. A Dax solution is needed as this will need to be dynamic as the employees names will be added
Add a new measure with below code and add that measure to your matrix.
Total_Lieu_running_total_to_Date =
SUMX (
SUMMARIZE (
table,//add the table from which the pay period end date is coming
table[pay period end date],//date column which you are using in matrix
"Lieu_running_total_to_Date",
[Lieu running total to Date] //measure which you using currently
),
[Lieu_running_total_to_Date]
)
Note:- Please add some sample data so it would help other to give you solution quickly.

performance for sum oracle

I have to sum a huge number of data with aggregation and where clause, using this query
what I am doing is like this : I have three tables one contains terms the second contains user terms , and the third contains correlation factor between term and user term.
I want to calculate the similarity between the sentence that that user inserted with an already existing sentences, and take the results greater than .5 by summing the correlation factor between sentences' terms
The problem is that this query takes more than 15 min. because I have huge tables
any suggestions to improve performance please?
insert into PLAG_SENTENCE_SIMILARITY
SELECT plag_TERMS.SENTENCE_ID ,plag_User_TERMS.SENTENCE_ID,
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length),
plag_TERMs.isn,
plag_user_terms.isn
FROM plag_TERM_CORRELATIONS3,
plag_TERMS,
Plag_User_TERMS
WHERE ( Plag_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM1
AND Plag_User_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM2
AND Plag_User_Terms.ISN=123)
having
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length) >0.5
group by (plag_User_TERMS.SENTENCE_ID,plag_TERMS.SENTENCE_ID , plag_TERMs.isn, plag_terms.sentence_length,plag_user_terms.sentence_length, plag_user_terms.isn);
plag_terms contains more than 50 million records and plag_correlations3 contains 500000
If you have a sufficient amount of free disk space, then create a materialized view
over the join of the three tables
fast-refreshable on commit (don't use the ANSI join syntax here, even if tempted to do so, or the mview won't be fast-refreshable ... a strange bug in Oracle)
with query rewrite enabled
properly physically organized for quick calculations
The query rewrite is optional. If you can modify the above insert-select, then you can just select from the materialized view instead of selecting from the join of the three tables.
As for the physical organization, consider
hash partitioning by Plag_User_Terms.ISN (with a sufficiently high number of partitions; don't hesitate to partition your table with e.g. 1024 partitions, if it seems reasonable) if you want to do a bulk calculation over all values of ISN
single-table hash clustering by Plag_User_Terms.ISN if you want to retain your calculation over a single ISN
If you don't have a spare disk space, then just hint your query to
either use nested loops joins, since the number of rows processed seems to be quite low (assumed by the estimations in the execution plan)
or full-scan the plag_correlations3 table in parallel
Bottom line: Constrain your tables with foreign keys, check constraints, not-null constraints, unique constraints, everything! Because Oracle optimizer is capable of using most of these informations to its advantage, as are the people who tune SQL queries.

Poor Performance of Mondrian w/ Degenerate Dimensions

I have an application that collects performance metrics and stores them in a datamart. I then use Mondrian to enable analysis and ad-hoc exploration of the data. I'm collecting about 5e6 rows per day and total size of the METRIC table is about 300M rows.
We "color" our data based on the metrics comparison to an SLA. There are exactly 5 distinct values for color. When we do simple MDX queries to get, for example, a color distribution of the data for a specific date range, say 1 day, we see queries like below:
2014-06-11 23:17:08,042 DEBUG [sql] - 223: SqlTupleReader.readTuples
[[Color].[Color]]: executing sql [select "METRIC"."COLOR" as "c0"
from "METRIC" "METRIC" group by "METRIC"."COLOR" order by
"METRIC"."COLOR" ASC NULLS LAST] 2014-06-11 23:17:58,747 DEBUG [sql] -
223: , exec 50704 ms
In order to improve performance, the datamart includes aggregate tables at the hour and day levels, and both aggregate tables include the COLOR column.
I understand that Mondrian is very dependent on the underlying database performance, but there is really no way to tune this. I can create an index on COLOR (because a full scan of the index will be marginally faster than a full scan of the table), but it seems silly to create an index with 5 distinct value on a 300M row table. The day aggregate table has about 500K rows and would be significantly faster executing virtually the same query against this table, but Mondrian always seems to go to the base fact table for these dimension queries.
My question is, is there some way to avoid this query? If I can't avoid it, is it possible to get Mondrian to use the aggregate tables for this type of query? I have specified approxRowCount in the single level of this dimension/hierarchy and that eliminated the similar query to get the count of values. I haven't dug into the source of Mondrian yet to determine if there is a possibility of using the aggregate table or if there is some configuration on my part that is preventing it.
Edit for Clarification:
I probably didn't do a good job of asking my question-let me try and clarify. My MDX query looks something like:
select [Color].[Color].Members on columns,
{[Measures].[Metric Value], [Measures].[Count]} on rows
from [Metric]
where [Time].[2014].[June].[11]
I can look at this and hand write a SQL query that answers this query
select COLOR, avg(VALUE), sum(FACT_COUNT)
from AGG_DAY_METRIC
where YEAR = 2014
and MONTH = 6
and DAY_OF_MONTH = 11
group by COLOR
The database answers this query in about 100ms scanning approx 4K rows. It takes Mondrian several minutes to answer the
query because it does several queries that don't answer the MDX query directly, but rather get information about the
dimension. In the case above, the database has to scan 300M rows, taking 50 seconds, to return that there are 5 possible
colors. If color was in a normal dimension table there would only be 5 rows, but in a degenerate dimension there can be 100s
of millions of rows.
So my questions are:
a) Is there a way to tell Mondrian the values of a degenerate dimension and avoid these queries?
b) Is there a way to have Mondrian answer these queries from aggregate tables?
This problem was solved, not by modifying anything in the Mondrian schema or the application, but the database. The database in this case was Oracle and we were able to create a materialized view with query rewrite enabled.
The materialized view is created from the exact query issued by Mondrian. Since the color values don't change very frequently (almost never in our case), the materialized view does a full refresh once a day.
In this case the queries went from taking minute(s) to milliseconds. If your facing an issue like this and your database is Oracle this is a good approach to speeding up the tuples resolution for degenerate dimensions with low cardinality.
It's hard to give any specific directions without knowing more about your schema, but it looks to me you have to make sure that the number of rows with certain colours (count) has to be marked defined as an aggregate measure (Count or Max Number).
Please note that these aggregates are not calculated continuously (I think it would be to heavy for the backing data-store, and Mondrian won't keep a flowing set in memory for incoming facts).
The aggregation can be specified to be ran/rebuilt at specific times (nightly, hourly...). This would make Mondrian a bit unsuitable for real-time analysis, but you should be able to do almost instant queries on historical data.
If your dimension has 5 distinct values in a 300M fact table it should not be a degenerate dimension. It should be in a separate dimension table. A degenerate dimension should ONLY be used if its cardinality is close to the full fact table row count, making a separate table pointless, as there would be no significant storage savings and joining the dimension results in a lot of data being read;
If you put the colors on a separate dim table, any "Read Tuples" query will return results in a few ms, and your problem is solved.
However, more to the point of your question, Mondrian should be able to pick the dim values from the agg tables. Unless you have distinct-count aggregators in the cube, in which case you're in a tricky situation (unless there's an agg table that exactly matches the level of detail you need, Mondrian will very likely scan the fact table).
You should also set the highCardinality attribute of this degenerate dimension to True. Even with only 5 distinct values, having highCardinality=false tells Mondrian it's safe to scan the whole dimension to populate the list of members. Setting it to true stops this scan.
You should also add an index to this column. It's always a good idea to add indexes to every key and degenerate dimension column in a fact table. With an index the DB should answer much faster that SQL query.
Finally, you have a 300M row fact table. What DBMS are you using? Is it a Column oriented DB? If not, you should try them as a possible alternative to your data store. Column oriented DB have a significant performance increase over Row oriented DBs for Mondrian-like queries. There are a few good options out there, you should test drive them.

Oracle 11g - doing analytic functions on millions of rows

My application allows users to collect measurement data as part of an experiment, and needs to have the ability to report on all of the measurements ever taken.
Below is a very simplified version of the tables I have:
CREATE TABLE EXPERIMENTS(
EXPT_ID INT,
EXPT_NAME VARCHAR2(255 CHAR)
);
CREATE TABLE USERS(
USER_ID INT,
EXPT_ID INT
);
CREATE TABLE SAMPLES(
SAMPLE_ID INT,
USER_ID INT
);
CREATE TABLE MEASUREMENTS(
MEASUREMENT_ID INT,
SAMPLE_ID INT,
MEASUREMENT_PARAMETER_1 NUMBER,
MEASUREMENT_PARAMETER_2 NUMBER
);
In my database there are 2000 experiments, each of which has 18 users. Each user has 6 samples to measure, and would do 100 measurements per sample.
This means that there are 2000 * 18 * 6 * 100 = 21600000 measurements currently stored in the database.
I'm trying to write a query that will get the AVG() of measurement parameter 1 and 2 for each user - that would return about 36,000 rows.
The query I have is extremely slow - I've left it running for over 30 minutes and it doesn't come back with anything. My question is: is there an efficient way of getting the averages? And is it actually possible to get results back for this amount of data in a reasonable time, say 2 minutes? Or am I being unrealistic?
Here's (again a simplified version) the query I have:
SELECT
E.EXPT_ID,
U.USER_ID,
AVG(MEASUREMENT_PARAMETER_1) AS AVG_1,
AVG(MEASUREMENT_PARAMETER_2) AS AVG_2
FROM
EXPERIMENTS E,
USERS U,
SAMPLES S,
MEASUREMENTS M
WHERE
U.EXPT_ID = E.EXPT_ID
AND S.USER_ID = U.USER_ID
AND M.SAMPLE_ID = S.SAMPLE_ID
GROUP BY E.EXPT_ID, U.USER_ID
This will return a row for each expt_id/user_id combination and the average of the 2 measurement parameters.
For your query, in any case, the DBMS needs to read the complete measurements table. This is by far the biggest part of data to read, and the part which takes most time if the query is optimized well (will come to that later). That means that the minimum runtime of your query is about the time it takes to read the complete measurements table from whereever it is stored. You can get a rough estimate by checking how much data that is (in MB or GB) and checking how much time it would take to read this amount of data from the harddisk (or where the table is stored). If your query runs slower by a factor of 5 or more, you can be sure that there is room for optimization.
There is a vast amount of information (tutorials, individual hints which can be invaluable, and general practices lists) about how to optimize oracle queries. You will not get through all this information quickly. But if you provide the execution plan of your query (this is what oracle's query optimizer thinks is the best way to fulfill your query), we will be able to spot steps which can be optimized and suggest solutions.

How to predict table sizes Oracle?

I'm trying to do a growth prediction on some tables I have and for that I've got to do some calculations on my row sizes, how many rows I generate by day and well.. the maths.
I'm calculating the average size of each row in my table as the sum of the average size of each field. So basicaly:
SELECT 'COL1' , avg(vsize(COL1)) FROM TABLE union
SELECT 'COL2' , avg(vsize(COL2)) FROM TABLE
Sum that up, multiply by the number of entries of a day and work the predictions from there.
Turns out that for one of the tables I've looked the resulting size is a lot smaller than I thought it would be and got me wondering if my method was right.
Also, I did not consider indexes sizes for my predictions - and of course I should.
My questions are:
Is this method I'm using reliable?
Tips on how could I work the predictions for the Indexes?
I've done my googling, but the methods I find are all about the segments and extends or else calculations based in the whole table. I will need the step with the actual row of my table to do the predictions (I have to analyse the data in the table in order to figure how many records a day).
And finally, this is an approximation. I know I'm missing some bytes here and there with overheads and stuff. I just want to make sure I'm only missing bytes and not gigas :)
1) Your method is sound to calculate the average size of a row. (Though be aware that if your column contains null, you should use avg(nvl(vsize(col1), 0)) instead of avg(vsize(COL1))). However, it doesn't take into account the physical arrangement of rows.
First of all, it doesn't take into account the header info (from both blocks and rows): you can't fit 8k data into 8k blocks. See the documentation on data block format for more information.
Then, rows are not always stored neatly packed. Oracle lets some space in each blocks so that the rows can grow when they are updated (governed by the pctfree parameter). Also when the rows are deleted the empty space is not reclaimed right away (if you're not using ASSM with locally managed tablespaces, the amount of free space required for a block to return to the list of available blocks depends on pctused).
If you already have some representative data in your table, you can estimate the amount of extra space you will need by comparing the space physically used (all_tables.blocks*block_size after having gathered statistics) to the average row length.
By the way Oracle can easily give you a good estimate of the average row length: gather statistics on the table and query all_tables.avg_row_len.
2) Most of the time (read: unless there is a bug or you fall into an atypical use of the index), the index will grow proportionaly to the number of rows.
If you have representative data, you can have a good estimation of its future size by multiplying its actual size by the relative growth of the number of rows.
The last time Oracle published their formulae for estimating the size of schema objects was in Oracle 8.0, which means the linked document is ten years out of date. However, I don't expect very much has changed in how Oracle reserves segment header, block header, or row header information.

Resources