Recursion in DAX - dax

I don't know if this is even possible, but I'd like to be able to create a calculated column where each row is dependent on the rows above it.
A classic example of this is the Fibonacci sequence, where the sequence is defined by the recurrence relationship F(n) = F(n-1) + F(n-2) and seeds F(1) = F(2) = 1.
In table form,
Index Fibonacci
----------------
1 1
2 1
3 2
4 3
5 5
6 8
7 13
8 21
9 34
10 55
... ...
I want to be able to construct the Fibonacci column as a calculated column.
Now, I know that the Fibonacci sequence has a nice closed form where I can define
Fibonacci = (((1 + SQRT(5))/2)^[Index] - ((1 - SQRT(5))/2)^[Index])/SQRT(5)
or using the shallow diagonals of Pascal's triangle form:
Fibonacci =
SUMX (
ADDCOLUMNS (
SELECTCOLUMNS (
GENERATESERIES ( 0, FLOOR ( ( [Index] - 1 ) / 2, 1 ) ),
"ID", [Value]
),
"BinomCoeff", IF (
[ID] = 0,
1,
PRODUCTX (
GENERATESERIES ( 1, [ID] ),
DIVIDE ( [Index] - [ID] - [Value], [Value] )
)
)
),
[BinomCoeff]
)
but this is not the case for recursively defined functions in general (or for the purposes I'm actually interested in using this for).
In Excel, this is easy to do. You would write a formula like this
A3 = A2 + A1
or in R1C1 notation,
= R[-1]C + R[-2]C
but I just can't figure out if this is even possible in DAX.
Everything I've tried either doesn't work or gives a circular dependency error. For example,
Fibonacci =
VAR n = [Index]
RETURN
IF(Table1[Index] <= 2,
1,
SUMX(
FILTER(Table1,
Table1[Index] IN {n - 1, n - 2}),
Table1[Fibonacci]
)
)
gives the error message
A circular dependency was detected: Table1[Fibonacci].
Edit:
In the book Tabular Modeling in Microsoft SQL Server Analysis Services by Marco Russo and Alberto Ferrari, DAX is described and includes this paragraph:
As a pure functional language, DAX does not have imperative statements, but it leverages special functions called iterators that execute a certain expression for each row of a given table expression. These arguments are close to the lambda expression in functional languages. However, there are limitations in the way you can combine them, so we cannot say they correspond to a generic lambda expression definition. Despite its functional nature, DAX does not allow you to define new functions and does not provide recursion.
It appears there is no straightforward way to do recursion. I do still wonder if there is a way to still do it indirectly somehow using Parent-Child functions, which appear to be recursive in nature.
Edit 2:
While general recursion doesn't seem feasible, don't forget that recursive formulas may have a nice closed form that can be fairly easily derived.
Here are a couple of examples where I use this workaround to sidestep recursive formulas:
How to perform sum of previous cells of same column in PowerBI
DAX - formula referencing itself

Based on your first sample dataset, it looks to me like a "sort of" Cummulative Total, which can probably calculated easily in SQL using WINDOW function-- I tried a couple things but nothing panned out just yet. I don't work with DAX enough to say if it can be done.
Edit: In reviewing a little closer the Fibonacci sequence, it turns out that my SQL code doing cumulative comparison is not correct. You can read the SO Post How to generate Fibonacci Series, and it has a few good SQL Fibonacci answers that I tested; in particular the post by N J - answered Feb 13 '14. I'm not sure of a DAX Fibonacci recursion function capability.
SQL Code (not quite correct):
DECLARE #myTable as table (Indx int)
INSERT INTO #myTable VALUES
(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)
SELECT
Indx
,SUM(myTable.Indx) OVER(ORDER BY myTable.Indx ASC ROWS BETWEEN UNBOUNDED PRECEDING and CURRENT ROW) -- + myTable.Indx
AS [Cummulative]
,SUM(myTable.Indx) OVER(ORDER BY myTable.Indx ASC ROWS BETWEEN UNBOUNDED PRECEDING and 2 PRECEDING)
+ SUM(myTable.Indx) OVER(ORDER BY myTable.Indx ASC ROWS BETWEEN UNBOUNDED PRECEDING and 1 PRECEDING)
AS [Fibonacci]
from #myTable myTable
Result Set:
+------+-------------+-----------+
| Indx | Cummulative | Fibonacci |
+------+-------------+-----------+
| 1 | 1 | NULL |
+------+-------------+-----------+
| 2 | 3 | NULL |
+------+-------------+-----------+
| 3 | 6 | 4 |
+------+-------------+-----------+
| 4 | 10 | 9 |
+------+-------------+-----------+
| 5 | 15 | 16 |
+------+-------------+-----------+
| 6 | 21 | 25 |
+------+-------------+-----------+
| 7 | 28 | 36 |
+------+-------------+-----------+
| 8 | 36 | 49 |
+------+-------------+-----------+
| 9 | 45 | 64 |
+------+-------------+-----------+
| 10 | 55 | 81 |
+------+-------------+-----------+
DAX Cummulative:
A link that could help calculate cumulative totals with DAX-- https://www.daxpatterns.com/cumulative-total/. And here is some sample code from the article.
Cumulative Quantity :=
CALCULATE (
SUM ( Transactions[Quantity] ),
FILTER (
ALL ( 'Date'[Date] ),
'Date'[Date] <= MAX ( 'Date'[Date] )
)
)

DAX language doesn't support recursion.
It's also been written in a sqlbi's article about calculation groups
DAX is not recursive, so Calculation Groups do not allow recursion. This is a good idea for controlling performance, but it requires a different approach compared to certain techniques that are possible in MDX Script by leveraging recursion.
https://www.sqlbi.com/blog/marco/2019/03/01/calculation-groups-in-dax-first-impressions/

Related

DAX for populating column value based on previous row of same column

I am trying to replicate this Excel formula or for loop that calculates the current row value based on the previous row value of the same column.
When value = 1, Result = Rank * 1
Else, Result = Rank * previous Result
| Value | Rank | Result |
| ----- | ---- | ------ |
| 1 | 3 | 3 |
| 2 | 2 | 6 |
| 3 | 1 | 6 |
I already tried generating a series table to get the Value and Rank columns but I am unable to refer to the existing Result column to update the same. Is there a way to get this done using DAX for dynamic number of rows?
It can be done, but you need to get your mindset out of recursion. Notice that I did [Rank] even for [Value] equal 1, as [Rank] * [Value] in that case is equal [Rank] (value equals 1, right?).
Result =
IF (
Table[Value] = 1,
[Rank],
[Rank] *
PRODUCTX (
FILTER ( Table, Table[Value] < EARLIER ( Table[Value] ) ),
[Rank]
)
)
EDIT:
The previous one was unneccessarily complex:
Result =
CALCULATE (
PRODUCT ( [Rank] ),
FILTER ( Table, [Value] <= EARLIER ( [Value] ) )
)

How to decide the probability percentage in question

I have the below question:
In the first part of the question, is says the probability that the selected person will be a male is 0.44, it means the number of males is 25*0.44 = 11. That's ok
In the second part, the probability of the selected person will be a male who was born before 1960 is 0.28, Does that mean 0.28 out of the total number which is 25 or out of the number of males?
I mean should the number of male who was born before 1960 equals into 250.28 OR 110.28
I find it easiest to think of these sorts of problems as contingency tables.
You use a maxtrix layout to express the distributions in terms of two or more factors or characteristics, each having two or more categories. The table can be constructed either with probabilities (proportions) or with counts, and switching back and forth is easy based on the total count in the table. Entries in the table are the intersections of the categories, corresponding to and in a verbal description. The numbers to the right or at the bottom of the table are called marginals, because they're found in the margins of the tables, and are always the sum of the table row or column entries in which they occur. The total probability (or count) in the table is found by summing across all the rows and columns. The marginal distribution of gender would be found by summing across rows, and the marginal distribution of birthdays would be found by summing across the columns.
Based on this, you can inferentially determine other values as indicated by the entries in parentheses below. With one more entry, either for gender or in the marginal row for birthdays, you'd be able to fill in the whole table inferentially. (This is related to the concept of degrees of freedom - how many pieces of info can you fill in independently before the others are determined by the known constraint that the totals are fixed or that probability adds to 1.)
Probabilities
Birthday
< 1960 | >= 1960
_______________________
G | | |
e F | | | (0.56)
n __|_________|__________|
d | | |
e M | 0.28 | (0.16) | 0.44
r __|_________|__________|______
? ? | 1.00
Counts
Birthday
< 1960 | >= 1960
_______________________
G | | |
e F | | | (14)
n __|_________|__________|
d | | |
e M | 7 | (4) | 11
r __|_________|__________|_____
? ? | 25
Conditional probability corresponds to limiting yourself to the subset of rows or columns specified in the condition. If you had been asked what is the probability of a birthday < 1960 given the gender is male, i.e., P{birthday < 1960 | M} in relatively standard notation, you'd be restricting your focus to just the M row, so the answer would be 7/11 = 0.28/0.44. Computationally, you take the probabilities or counts in the qualifying table entries and express them as a proportion of the probabilities or counts of the specified (given) marginal entries. This is often written in prob & stats texts as P(A|B) = P(AB)/P(B), where AB is a set shorthand for A and B (intersection).
0,44 = 11 / 25 people are male.
0,28 = 7 / 25 people are male & born before 1960.

How do I get one row for every Min or Max on every column of a dataframe in Pyspark efficiently?

I'm trying to reduce a big dataset to rows having minimum and maximum values for each column. In other words, I would like, for every column of this dataset to get one row that has the minimum value on that column, as well as another that has the maximum value on the same column. I should mention that I do not know in advance what columns this dataset will have. Here's an example:
+----+----+----+ +----+----+----+
|Col1|Col2|Col3| ==> |Col1|Col2|Col3|
+----+----+----+ +----+----+----+
| F | 99 | 17 | | A | 34 | 25 |
| M | 32 | 20 | | Z | 51 | 49 |
| D | 2 | 84 | | D | 2 | 84 |
| H | 67 | 90 | | F | 99 | 17 |
| P | 54 | 75 | | C | 18 | 9 |
| C | 18 | 9 | | H | 67 | 90 |
| Z | 51 | 49 | +----+----+----+
| A | 34 | 25 |
+----+----+----+
The first row is selected because A is the smallest value on Col1. The second because Z is the largest value on Col1. The third because 2 is the smallest on Col2, and so on. The code below seems to do the right thing (correct me if I'm wrong), but performance is sloooow. I start with getting a dataframe from a random .csv file:
input_file = (sqlContext.read
.format("csv")
.options(header="true", inferSchema="true", delimiter=";", charset="UTF-8")
.load("/FileStore/tables/random.csv")
)
Then I create two other dataframes that each have one row with the min and respectively, max values of each column:
from pyspark.sql.functions import col, min, max
min_values = input_file.select(
*[min(col(col_name)).name(col_name) for col_name in input_file.columns]
)
max_values = input_file.select(
*[max(col(col_name)).name(col_name) for col_name in input_file.columns]
)
Finally, I repeatedly join the original input file to these two dataframes holding minimum and maximum values, using every column in turn, and do a union between all the results.
min_max_rows = (
input_file
.join(min_values, input_file[input_file.columns[0]] == min_values[input_file.columns[0]])
.select(input_file["*"]).limit(1)
.union(
input_file
.join(max_values, input_file[input_file.columns[0]] == max_values[input_file.columns[0]])
.select(input_file["*"]).limit(1)
)
)
for c in input_file.columns[1:]:
min_max_rows = min_max_rows.union(
input_file
.join(min_values, input_file[c] == min_values[c])
.select(input_file["*"]).limit(1)
.union(
input_file
.join(max_values, input_file[c] == max_values[c])
.select(input_file["*"]).limit(1)
)
)
min_max_rows.dropDuplicates()
For my test dataset of 500k rows, 40 columns, doing all this takes about 7-8 minutes on a standard Databricks cluster. I'm supposed to sift through more than 20 times this amount of data regularly. Is there any way to optimize this code? I'm quite afraid I've taken the naive approach to it, since I'm quite new to Spark.
Thanks!
Does not seem to be a popular question, but interesting (for me). And a lot of work for 15 pts. In fact I got it wrong first time round.
Here is a scaleable solution that you can partition accordingly to increase throughput.
Hard to explain, manipulation of the data and transposing the data is
the key issue here - and some lateral thinking.
I did not focus on variable columns all sorts of data types. That needs to be solved by yourself, can be done but some if else logic required to check if alpha or double or numeric. Mixing data types and applying to stuff gets problematic, but can be solved. I gave a notion of num_string, but did not complete that.
I have focused on the scalability issue and approach, with less procedural logic. Smaller sample size with all numbers, but correct as now as far as I can see. General principle is there.
Try it. Success.
Code:
from pyspark.sql.functions import *
from pyspark.sql.types import *
def reshape(df, by):
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
df1 = spark.createDataFrame(
[(4, 15, 3), (200, 100, 25), (7, 16, 4)], ("c1", "c2", "c3"))
df1 = df1.withColumn("rowId", monotonically_increasing_id())
df1.cache
df1.show()
df2 = reshape(df1, ["rowId"])
df2.show()
# In case you have other types like characters in the other column - not focusing on that aspect
df3 = df2.withColumn("num_string", format_string("%09d", col("val")))
# Avoid column name issues.
df3 = df3.withColumn("key1", col("key"))
df3.show()
df3 = df3.groupby('key1').agg(min(col("val")).alias("min_val"), max(col("val")).alias("max_val"))
df3.show()
df4 = df3.join(df2, df3.key1 == df2.key)
new_column_condition = expr(
"""IF(val = min_val, -1, IF(val = max_val, 1, 0))"""
)
df4 = df4.withColumn("col_class", new_column_condition)
df4.show()
df5 = df4.filter( '(min_val = val or max_val = val) and col_class <> 0' )
df5.show()
df6 = df5.join(df1, df5.rowId == df1.rowId)
df6.show()
df6.select([c for c in df6.columns if c in ['c1','c2', 'c3']]).distinct().show()
Returns:
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 4| 15| 3|
|200|100| 25|
+---+---+---+
Data wrangling the clue here.

Nested cos() calculation in Oracle 10

I have table with some positive integer numbers
n
----
1
2
5
10
For each row of this table I want values cos(cos(...cos(0)..)) (cos is applied n times) to be calculated by means of SQL statement (PL/SQL stored procedures and functions are not allowed):
n coscos
--- --------
1 1
2 0.540302305868
5 0.793480358743
10 0.731404042423
I can do this in Oracle 11g by using recursive queries.
Is it possible to do the same in Oracle 10g ?
The MODEL clause can solve this:
Test data:
create table test1(n number unique);
insert into test1 select * from table(sys.odcinumberlist(1,2,5,10));
commit;
Query:
--The last row for each N has the final coscos value.
select n, coscos
from
(
--Set each value to the cos() of the previous value.
select * from
(
--Each value of N has N rows, with value rownumber from 1 to N.
select n, rownumber
from
(
--Maximum number of rows needed (the largest number in the table)
select level rownumber
from dual
connect by level <= (select max(n) from test1)
) max_rows
cross join test1
where max_rows.rownumber <= test1.n
order by n, rownumber
) n_to_rows
model
partition by (n)
dimension by (rownumber)
measures (0 as coscos)
(
coscos[1] = cos(0),
coscos[rownumber > 1] = cos(coscos[cv(rownumber)-1])
)
)
where n = rownumber
order by n;
Results:
N COSCOS
1 1
2 0.54030230586814
5 0.793480358742566
10 0.73140404242251
Let the holy wars begin:
Is this query a good idea? I wouldn't run this query in production, but hopefully it is a useful demonstration that any problem can be solved with SQL.
I've seen literally thousands of hours wasted because people are afraid to use SQL. If you're heavily using a database it is foolish to not use SQL as your primary programming language. It's good to occasionally spend a few hours to test the limits of SQL. A few strange queries is a small price to pay to avoid the disastrous row-by-row processing mindset that infects many database programmers.
Using WITH FUNCTION(Oracle 12c):
WITH FUNCTION coscos(n INT) RETURN NUMBER IS
BEGIN
IF n > 1
THEN RETURN cos(coscos(n-1));
ELSE RETURN cos(0);
END IF;
END;
SELECT n, coscos(n)
FROM t;
db<>fiddle demo
Output:
+-----+-------------------------------------------+
| N | COSCOS(N) |
+-----+-------------------------------------------+
| 1 | 1 |
| 2 | .5403023058681397174009366074429766037354 |
| 5 | .793480358742565591826054230990284002387 |
| 10 | .7314040424225098582924268769524825209688 |
+-----+-------------------------------------------+

Faster computation to get multiples of number at different levels

Here is the scenario:
We have several items that are shipped to many stores. We want to be able to allocate a certain quantity of each item to a store based on need. Each of these stores is also associated to a specific warehouse.
The catch is that at the warehouse level, the total quantity of each item must be a multiple of a number (6 for example).
I have already calculated out the quantity needed by each store at store level, but they do not sum up to a multiple of 6 at the warehouse level.
My solution was this using Excel:
Using a SUMIFS formula to keep track of the sum of each item allocated at the warehouse level. Then another MOD(6) formula that calculates the remaining until a multiple of 6. Then my actually VBA code loops through and subtracts 1 (if MOD <= 3) or adds (if MOD > 3) from the store level units needed until MOD = 0 for all rows.
Now this works for me, but is extremely slow even when I have just ~5000 rows.
I am looking for a faster solution, because everytime I subtract/add to units needed, the SUMIFS and MOD need to be calculated again.
EDIT: (trying to be clearer)
I have a template file that I paste my data into with the following setup:
+------+-------+-----------+----------+--------------+--------+
| Item | Store | Warehouse | StoreQty | WarehouseQty | Mod(6) |
+------+-------+-----------+----------+--------------+--------+
| 1 | 1 | 1 | 2 | 8 | 2 |
| 1 | 2 | 1 | 3 | 8 | 2 |
| 1 | 3 | 1 | 1 | 8 | 2 |
| 1 | 4 | 1 | 2 | 8 | 2 |
| 2 | 1 | 2 | 1 | 4 | 2 |
| 2 | 2 | 2 | 3 | 4 | 2 |
+------+-------+-----------+----------+--------------+--------+
Currently the WarehouseQty column is the SUMIFS formula summing up the StoreQty for each Item-Store combo that is associated to the Warehouse. So I guess the Warehouse/WarehouseQty columns is actually duplicated several times every time an Item-Store combo shows up. The WarehouseQty is the one that needs to be a multiple of 6.
Screen updating can be turned OFF to speed up length computations like this:
Application.ScreenUpdating = FALSE
The opposite assignment turns screen updating back on again.
put the data into an array first, rather than cells, then put the data back after you have manipulated it - this will be much faster.
an example which uses your criteria:
Option Explicit
Sub test()
Dim q() 'this is what will be used for the range
Dim i As Long
q = Range("C2:C41") 'put the data into the array - *ALWAYS* 2 dimensions, even if a single column
For i = LBound(q) To UBound(q) ' use this, in case it's a dynamic array - 1 to 40 would have worked here
Select Case q(i, 1) Mod 6 ' calculate remander
Case 0 To 3
q(i, 1) = q(i, 1) - (q(i, 1) Mod 6) 'make a multiple of 6
Case 4 To 5
q(i, 1) = q(i, 1) - (q(i, 1) Mod 6) + 6 ' and go higher in the later numbers
End Select
Next i
Range("D2:D41") = q ' drop the data back
End Sub
Guessing you may find that stopping the screen refresh may help quite a chunk and therefore not need any more suggestions.
Another option would be to reduce your adjustment to a quantity which is divisible by 6 to a number of if statements, depending on the value of mod(6).
You could also address how you sum up the number of a particular item across all stores, using a pivot table and reading the sum totals from there is a lot quicker than using sumifs in a macro
Based on your modifications to the question:
You're correct that you could have huge amounts of replication doing the calculation row by row, as well as adjusting the quantity by a single unit at a time even though you know exactly how many units you need to add / remove from the mod(6) formula.
Could you not create a new sheet with all your possible combinations of product Id and store. You could then use sumifs() for each of these unique combinations and in a final step round up/down at a warehouse level?

Resources