How to create a tooltip measure in power bi that displays a moving sum of values? - filter

I have a bar chart that shows values per week. I created a measure in a tooltip that shows the sum of these values per week regardless of the week filter. When I'm slicing the period by week with a different week-slicer, I would like this sum to change according to the selected period. See images on how it now works:
week 1
week 2
week 3
I used this measure for the tooltip:
sum = calculate(sum(table[value]), ALLSELECTED('Date'), 'Date'[Week] <= max('Date'[Week]))
This works, so the weeks on the x-axis do not longer filter the sum in the tooltip. However, I also have a slicer on this page that filters per week. When I select a week-period (e.g. week 2 to week 3), this does not effect the sum. I would like to see week 2 = 134 and week 3 = 134 + 22 = 156. Can anyone support me on how to do make this time selection effect the summation?
I tried adding a min() solution, but this doesn't work. The result is that it no longer sums up values regardless of the visual weeks:
sum = calculate(sum(table[value]), ALLSELECTED('Date'), 'Date'[Week] <= max('Date'[Week]), 'Date'[Week]
>= min('Date'[Week]))
Thanks!

Related

Using the first row in bin (instead of average) to calculate percentage gain

In the dc.js Nasdaq example, percentageGain is calculated as:
(p.absGain / p.avgIndex) * 100
Here avgIndex is the average of all the day-averages.
I'm more familiar with the equation:
A. (Price - Prev period's Close) / Prev period's Close * 100
I'm not sure whether this is possible (with filters set and so on), the way crossfilter/dc works. Therefor, an alternative and different equation ,that might fit crossfilter/dc better and would still be meaningful, could be:
B. absGain of group / open of first day of group * 100
B would also mean that: If only a filter is set on for example Q1, then only the absGain of Q1 is taken into account. The first day in this group is the the oldest Q1 date in the oldest year. Also, charts other than "yearly" with groups like quarter, month or day of the week should be able to display the value of this equation. For example in a month chart, the value of the month "June" is calculated by taking the open of the first day in the first June. The absGain is taken from all June months. (of course working with all current filters in place)
Question: Can A and/or B be solved the crossfilter/dc way and how (example)?
Even if only B could be solved (naturally with crossfilter/dc), that would already be great. I want to use the dc.js example for other stocks that have the same underlying data structure (open, close, high, low, volume)
thanks!
I agree that Equation B is easier to define using crossfilter, so I figured out one way to do it.
Equation A could probably work but it's unclear which day's close should be used under filtering - the last day which is not in the current bin? The day before the first day in the current bin?
Equation B needs the earliest row for the current bin, and that requires maintaining the array of all rows for each bin. This is not built into crossfilter but it's a feature which we have talked about adding.
The complex reduce example does this, and we can reuse some of its code. It calculates the median/mode/min/max value from the arrays of rows which fall in each bin, using these functions to generate those arrays:
function groupArrayAdd(keyfn) {
var bisect = d3.bisector(keyfn);
return function(elements, item) {
var pos = bisect.right(elements, keyfn(item));
elements.splice(pos, 0, item);
return elements;
};
}
function groupArrayRemove(keyfn) {
var bisect = d3.bisector(keyfn);
return function(elements, item) {
var pos = bisect.left(elements, keyfn(item));
if(keyfn(elements[pos])===keyfn(item))
elements.splice(pos, 1);
return elements;
};
}
It's somewhat inefficient to maintain all these arrays, so you might test if it has an impact on your application. JS is pretty fast so it probably doesn't matter unless you have a lot of data.
Unfortunately there is no other way to compute the minimum for a bin other than to keep an array of all the items in it. (If you tried to keep track of just the lowest item, or lowest N items, what would you do when they are removed?)
Using these arrays inside the group reduce-add function:
(p, v) => {
++p.count;
p.rowsByDate = rbdAdd(p.rowsByDate, v);
p.absGain += v.close - v.open;
// ...
p.percentageGain = p.rowsByDate.length ? (p.absGain / p.rowsByDate[0].open) * 100 : 0;
return p;
},
In the reduce-remove function it's
p.rowsByDate = rbdRemove(p.rowsByDate, v);
and the same percentageGain change.
Here is a demo in a notebook: https://jsfiddle.net/gordonwoodhull/08bzcd4y/17/
I only see slight changes in the Y positions of the bubbles; the changes are more apparent in the values printed in the tooltip.

Calculating Moving Average for N Months in DAX Power BI

I have a measure that calculates Moving Average for 3 months:
Moving_Avg_3_Months = AVERAGEX(DATESINPERIOD('Calendar FY'[Date],
LASTDATE('Calendar FY'[Date]), -3, MONTH),[CUS Revenue Credible All])
Is it possible to create a measure that would calculate Moving Average for my [CUS Revenue Credible All] - but for N months. Where N = 3 or N = 6 or N = whatever number I'd like?
If you create a new table with the different values for moving average you want to use eg. TableMovingAverage: [-3,-6,-12,-24,...,N]
and modify you DAX formula like this:
Moving_Avg_3_Months =
AVERAGEX(
DATESINPERIOD('Calendar FY'[Date],
LASTDATE('Calendar FY'[Date]),
SELECTEDVALUE('TableMovingAverage', -3),
MONTH),
[CUS Revenue Credible All])
SELECTEDVALUE returns a scalar if only one value is in the specified table, otherwise it return a default value -3 in this case.
If you filter TableMovingAverage you can switch between different moving averages

Microsoft Excel cumulative calculation performance

I know 2 ways to calculate cumulative values in Excel.
1st method:
A B
Value Cumulative total
9 =A1
8 =B1+A2
7 =B2+A3
6 =B3+A4
2nd method:
A B
Value Cumulative total
9 =SUM($A$1:A1)
8 =SUM($A$1:A2)
7 =SUM($A$1:A3)
6 =SUM($A$1:A4)
2 questions:
Which method has better performance when the data set gets really big (say 100k rows)? 1st method seems to be having less overhead. Because when adding a new value in column A (Value), new cell in column B only needs to do "B(n-1)+A(n)". Where in 2nd method, is it smart enough to do similar? Or it will adds 100k rows from A1:A(n)?
What's the best way to calculate the cumulative values? I found 2nd method is more popular though I doubt its performance. The only upside for 2nd method I can see now is the formula in column B cells are more consistent. In 1st method, the 1st cell in column B has to be a determined in advance.
number sequence 9, 8, 7, 6, -9, -8, -7, -6; workbook set to manual calculation, triggered by following code:
Sub ManualCalc()
Dim R As Range
Set R = Selection
[F1] = Now()
R.Worksheet.Calculate
[F2] = Now()
[F3] = ([F2] - [F1]) * 86400
End Sub
At 4096 rows calculation time is not measurable for both variants (0 seconds), at 65536 rows your 1st method is still not measurable, your 2nd method takes a bit less than 8 seconds on my laptop (Dell Latitude E6420, Win7, Office2010 - average of 3 measurements each). So for high number of rows I would therefore prefer method 1.
Regarding your Q1 ... yes it would add 100k sums of ever growing ranges ... Excel is not supposed to be smart, it's supposed to calculate whatever you ask it to calculate. If it did, it would interpret the intention of a set of formulas at runtime which I'd regard as very dangerous!

How to decide on weights?

For my work, I need some kind of algorithm with the following input and output:
Input: a set of dates (from the past). Output: a set of weights - one weight per one given date (the sum of all weights = 1).
The basic idea is that the closest date to today's date should receive the highest weight, the second closest date will get the second highest weight, and so on...
Any ideas?
Thanks in advance!
First, for each date in your input set assign the amount of time between the date and today.
For example: the following date set {today, tomorrow, yesterday, a week from today} becomes {0, 1, 1, 7}. Formally: val[i] = abs(today - date[i]).
Second, inverse the values in such a way that their relative weights are reversed. The simplest way of doing so would be: val[i] = 1/val[i].
Other suggestions:
val[i] = 1/val[i]^2
val[i] = 1/sqrt(val[i])
val[i] = 1/log(val[i])
The hardest and most important part is deciding how to inverse the values. Think, what should be the nature of the weights? (do you want noticeable differences between two far away dates, or maybe two far away dates should have pretty equal weights? Do you want a date which is very close to today have an extremely bigger weight or a reasonably bigger weight?).
Note that you should come up with an inverting procedure where you cannot divide by zero. In the example above, dividing by val[i] results in division by zero. One method to avoid division by zero is called smoothing. The most trivial way to "smooth" your data is using the add-one smoothing where you just add one to each value (so today becomes 1, tomorrow becomes 2, next week becomes 8, etc).
Now the easiest part is to normalize the values so that they'll sum up to one.
sum = val[1] + val[2] + ... + val[n]
weight[i] = val[i]/sum for each i
Sort dates and remove dups
Assign values (maybe starting from the farthest date in steps of 10 or whatever you need - these value can be arbitrary, they just reflect order and distance)
Normalize weights to add up to 1
Executable pseudocode (tweakable):
#!/usr/bin/env python
import random, pprint
from operator import itemgetter
# for simplicity's sake dates are integers here ...
pivot_date = 1000
past_dates = set(random.sample(range(1, pivot_date), 5))
weights, stepping = [], 10
for date in sorted(past_dates):
weights.append( (date, stepping) )
stepping += 10
sum_of_steppings = sum([ itemgetter(1)(x) for x in weights ])
normalized = [ (d, (w / float(sum_of_steppings)) ) for d, w in weights ]
pprint.pprint(normalized)
# Example output
# The 'date' closest to 1000 (here: 889) has the highest weight,
# 703 the second highest, and so forth ...
# [(151, 0.06666666666666667),
# (425, 0.13333333333333333),
# (571, 0.2),
# (703, 0.26666666666666666),
# (889, 0.3333333333333333)]
How to weight: just compute the difference of all dates and the current date
x(i) = abs(date(i) - current_date)
you can then use different expression to assign weights:
w(i) = 1/x(i)
w(i) = exp(-x(i))
w(i) = exp(-x(i)^2))
use gaussian distribution - more complicated, do not recommend
Then use normalized weights: w(i)/sum(w(i)) so that the sum is 1.
(Note that the exponential func is always used by statisticians in survival analysis)
The first thing that comes to my mind to to use a geometric series:
http://en.wikipedia.org/wiki/Geometric_series
(1/2)+(1/4)+(1/8)+(1/16)+(1/32)+(1/64)+(1/128)+(1/256)..... sums to one.
Yesterday would be 1/2
2 days ago would be 1/4
and so on
Is is the index for the i-th date.
Assign weights equal to to Ni / D.
D0 is the first date.
Ni is the difference in days between the i-th date and the first date D0.
D is the normalization factor
converts dates to yyyymmddhhmiss format (24 hours), add all these values ​​and the total, divide by the total time, and sort by this value.
declare #data table
(
Date bigint,
Weight float
)
declare #sumTotal decimal(18,2)
insert into #Data (Date)
select top 100
replace(replace(replace(convert(varchar,Datetime,20),'-',''),':',''),' ','')
from Dates
select #sumTotal=sum(Date)
from #Data
update #Data set
Weight=Date/#sumTotal
select * from #Data order by 2 desc

Algorithm for nice graph labels for time/date axis?

I'm looking for a "nice numbers" algorithm for determining the labels on a date/time value axis. I'm familiar with Paul Heckbert's Nice Numbers algorithm.
I have a plot that displays time/date on the X axis and the user can zoom in and look at a smaller time frame. I'm looking for an algorithm that picks nice dates to display on the ticks.
For example:
Looking at a day or so: 1/1 12:00, 1/1 4:00, 1/1 8:00...
Looking at a week: 1/1, 1/2, 1/3...
Looking at a month: 1/09, 2/09, 3/09...
The nice label ticks don't need to correspond to the first visible point, but close to it.
Is anybody familiar with such an algorithm?
The 'nice numbers' article you linked to mentioned that
the nicest numbers in decimal are 1, 2, 5 and all power-of-10 multiples of these numbers
So I think for doing something similar with date/time you need to start by similarly breaking down the component pieces. So take the nice factors of each type of interval:
If you're showing seconds or minutes use 1, 2, 3, 5, 10, 15, 30
(I skipped 6, 12, 15, 20 because they don't "feel" right).
If you're showing hours use 1, 2, 3, 4, 6, 8, 12
for days use 1, 2, 7
for weeks use 1, 2, 4 (13 and 26 fit the model but seem too odd to me)
for months use 1, 2, 3, 4, 6
for years use 1, 2, 5 and power-of-10 multiples
Now obviously this starts to break down as you get into larger amounts. Certainly you don't want to do show 5 weeks worth of minutes, even in "pretty" intervals of 30 minutes or something. On the other hand, when you only have 48 hours worth, you don't want to show 1 day intervals. The trick as you have already pointed out is finding decent transition points.
Just on a hunch, I would say a reasonable crossover point would be about twice as much as the next interval. That would give you the following (min and max number of intervals shown afterwards)
use seconds if you have less than 2 minutes worth (1-120)
use minutes if you have less than 2 hours worth (2-120)
use hours if you have less than 2 days worth (2-48)
use days if you have less than 2 weeks worth (2-14)
use weeks if you have less than 2 months worth (2-8/9)
use months if you have less than 2 years worth (2-24)
otherwise use years (although you could continue with decades, centuries, etc if your ranges can be that long)
Unfortunately, our inconsistent time intervals mean that you end up with some cases that can have over 1 hundred intervals while others have at most 8 or 9. So you'll want to pick the size of your intervals such than you don't have more than 10-15 intervals at most (or less than 5 for that matter). Also, you could break from a strict definition of 2 times the next biggest interval if you think its easy to keep track of. For instance, you could use hours up to 3 days (72 hours) and weeks up to 4 months. A little trial and error might be necessary.
So to go back over, choose the interval type based on the size of your range, then choose the interval size by picking one of the "nice" numbers that will leave you with between 5 and about 15 tick marks. Or if you know and/or can control the actual number of pixels between tick marks you could put upper and lower bounds on how many pixels are acceptable between ticks (if they are spaced too far apart the graph may be hard to read, but if there are too many ticks the graph will be cluttered and your labels may overlap).
Have a look at
http://tools.netsa.cert.org/netsa-python/doc/index.html
It has a nice.py ( python/netsa/data/nice.py ) which i think is stand-alone, and should work fine.
Still no answer to this question... I'll throw my first idea in then! I assume you have the range of the visible axis.
This is probably how I would do.
Rough pseudo:
// quantify range
rangeLength = endOfVisiblePart - startOfVisiblePart;
// qualify range resolution
if (range < "1.5 day") {
resolution = "day"; // it can be a number, e.g.: ..., 3 for day, 4 for week, ...
} else if (range < "9 days") {
resolution = "week";
} else if (range < "35 days") {
resolution = "month";
} // you can expand this in both ways to get from nanoseconds to geological eras if you wish
After that, it should (depending on what you have easy access to) be quite easy to determine the value to each nice label tick. Depending on the 'resolution', you format it differently. E.g.: MM/DD for "week", MM:SS for "minute", etc., just like you said.
[Edit - I expanded this a little more at http://www.acooke.org/cute/AutoScalin0.html ]
A naive extension of the "nice numbers" algorithm seems to work for base 12 and 60, which gives good intervals for hours and minutes. This is code I just hacked together:
LIM10 = (10, [(1.5, 1), (3, 2), (7, 5)], [1, 2, 5])
LIM12 = (12, [(1.5, 1), (3, 2), (8, 6)], [1, 2, 6])
LIM60 = (60, [(1.5, 1), (20, 15), (40, 30)], [1, 15, 40])
def heckbert_d(lo, hi, ntick=5, limits=None):
'''
Heckbert's "nice numbers" algorithm for graph ranges, from "Graphics Gems".
'''
if limits is None:
limits = LIM10
(base, rfs, fs) = limits
def nicenum(x, round):
step = base ** floor(log(x)/log(base))
f = float(x) / step
nf = base
if round:
for (a, b) in rfs:
if f < a:
nf = b
break
else:
for a in fs:
if f <= a:
nf = a
break
return nf * step
delta = nicenum(hi-lo, False)
return nicenum(delta / (ntick-1), True)
def heckbert(lo, hi, ntick=5, limits=None):
'''
Heckbert's "nice numbers" algorithm for graph ranges, from "Graphics Gems".
'''
def _heckbert():
d = heckbert_d(lo, hi, ntick=ntick, limits=limits)
graphlo = floor(lo / d) * d
graphhi = ceil(hi / d) * d
fmt = '%' + '.%df' % max(-floor(log10(d)), 0)
value = graphlo
while value < graphhi + 0.5*d:
yield fmt % value
value += d
return list(_heckbert())
So, for example, if you want to display seconds from 0 to 60,
>>> heckbert(0, 60, limits=LIM60)
['0', '15', '30', '45', '60']
or hours from 0 to 5:
>>> heckbert(0, 5, limits=LIM12)
['0', '2', '4', '6']
I'd suggest you grab the source code to gnuplot or RRDTool (or even Flot) and examine how they approach this problem. The general case is likely to be N labels applied based on width of your plot, which some kind of 'snapping' to the nearest 'nice' number.
Every time I've written such an algorithm (too many times really), I've used a table of 'preferences'... ie: based on the time range on the plot, decide if I'm using Weeks, Days, Hours, Minutes etc as the main axis point. I usually included some preferred formatting, as I rarely want to see the date for each minute I plot on the graph.
I'd be happy but surprised to find someone using a formula (like Heckbert does) to find 'nice', as the variation in time units between minutes, hours, days, and weeks are not that linear.
In theory you can also change your concept. Where it is not your data at the center of the visualization, but at the center you have your scale.
When you know the start and the end of the dates of your data, you can create a scale with all dates and dispatch you data in this scale. Like a fixed scales.
You can have a scale of type year, month, day, hours, ... and limit the scaling just to these scales, implying you remove the concept of free scaling.
The advantage is to can easily show dates gaps. But if you have a lot of gaps, that can become also useless.

Resources