DAX for populating column value based on previous row of same column - for-loop

I am trying to replicate this Excel formula or for loop that calculates the current row value based on the previous row value of the same column.
When value = 1, Result = Rank * 1
Else, Result = Rank * previous Result
| Value | Rank | Result |
| ----- | ---- | ------ |
| 1 | 3 | 3 |
| 2 | 2 | 6 |
| 3 | 1 | 6 |
I already tried generating a series table to get the Value and Rank columns but I am unable to refer to the existing Result column to update the same. Is there a way to get this done using DAX for dynamic number of rows?

It can be done, but you need to get your mindset out of recursion. Notice that I did [Rank] even for [Value] equal 1, as [Rank] * [Value] in that case is equal [Rank] (value equals 1, right?).
Result =
IF (
Table[Value] = 1,
[Rank],
[Rank] *
PRODUCTX (
FILTER ( Table, Table[Value] < EARLIER ( Table[Value] ) ),
[Rank]
)
)
EDIT:
The previous one was unneccessarily complex:
Result =
CALCULATE (
PRODUCT ( [Rank] ),
FILTER ( Table, [Value] <= EARLIER ( [Value] ) )
)

Related

How do I get one row for every Min or Max on every column of a dataframe in Pyspark efficiently?

I'm trying to reduce a big dataset to rows having minimum and maximum values for each column. In other words, I would like, for every column of this dataset to get one row that has the minimum value on that column, as well as another that has the maximum value on the same column. I should mention that I do not know in advance what columns this dataset will have. Here's an example:
+----+----+----+ +----+----+----+
|Col1|Col2|Col3| ==> |Col1|Col2|Col3|
+----+----+----+ +----+----+----+
| F | 99 | 17 | | A | 34 | 25 |
| M | 32 | 20 | | Z | 51 | 49 |
| D | 2 | 84 | | D | 2 | 84 |
| H | 67 | 90 | | F | 99 | 17 |
| P | 54 | 75 | | C | 18 | 9 |
| C | 18 | 9 | | H | 67 | 90 |
| Z | 51 | 49 | +----+----+----+
| A | 34 | 25 |
+----+----+----+
The first row is selected because A is the smallest value on Col1. The second because Z is the largest value on Col1. The third because 2 is the smallest on Col2, and so on. The code below seems to do the right thing (correct me if I'm wrong), but performance is sloooow. I start with getting a dataframe from a random .csv file:
input_file = (sqlContext.read
.format("csv")
.options(header="true", inferSchema="true", delimiter=";", charset="UTF-8")
.load("/FileStore/tables/random.csv")
)
Then I create two other dataframes that each have one row with the min and respectively, max values of each column:
from pyspark.sql.functions import col, min, max
min_values = input_file.select(
*[min(col(col_name)).name(col_name) for col_name in input_file.columns]
)
max_values = input_file.select(
*[max(col(col_name)).name(col_name) for col_name in input_file.columns]
)
Finally, I repeatedly join the original input file to these two dataframes holding minimum and maximum values, using every column in turn, and do a union between all the results.
min_max_rows = (
input_file
.join(min_values, input_file[input_file.columns[0]] == min_values[input_file.columns[0]])
.select(input_file["*"]).limit(1)
.union(
input_file
.join(max_values, input_file[input_file.columns[0]] == max_values[input_file.columns[0]])
.select(input_file["*"]).limit(1)
)
)
for c in input_file.columns[1:]:
min_max_rows = min_max_rows.union(
input_file
.join(min_values, input_file[c] == min_values[c])
.select(input_file["*"]).limit(1)
.union(
input_file
.join(max_values, input_file[c] == max_values[c])
.select(input_file["*"]).limit(1)
)
)
min_max_rows.dropDuplicates()
For my test dataset of 500k rows, 40 columns, doing all this takes about 7-8 minutes on a standard Databricks cluster. I'm supposed to sift through more than 20 times this amount of data regularly. Is there any way to optimize this code? I'm quite afraid I've taken the naive approach to it, since I'm quite new to Spark.
Thanks!
Does not seem to be a popular question, but interesting (for me). And a lot of work for 15 pts. In fact I got it wrong first time round.
Here is a scaleable solution that you can partition accordingly to increase throughput.
Hard to explain, manipulation of the data and transposing the data is
the key issue here - and some lateral thinking.
I did not focus on variable columns all sorts of data types. That needs to be solved by yourself, can be done but some if else logic required to check if alpha or double or numeric. Mixing data types and applying to stuff gets problematic, but can be solved. I gave a notion of num_string, but did not complete that.
I have focused on the scalability issue and approach, with less procedural logic. Smaller sample size with all numbers, but correct as now as far as I can see. General principle is there.
Try it. Success.
Code:
from pyspark.sql.functions import *
from pyspark.sql.types import *
def reshape(df, by):
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
df1 = spark.createDataFrame(
[(4, 15, 3), (200, 100, 25), (7, 16, 4)], ("c1", "c2", "c3"))
df1 = df1.withColumn("rowId", monotonically_increasing_id())
df1.cache
df1.show()
df2 = reshape(df1, ["rowId"])
df2.show()
# In case you have other types like characters in the other column - not focusing on that aspect
df3 = df2.withColumn("num_string", format_string("%09d", col("val")))
# Avoid column name issues.
df3 = df3.withColumn("key1", col("key"))
df3.show()
df3 = df3.groupby('key1').agg(min(col("val")).alias("min_val"), max(col("val")).alias("max_val"))
df3.show()
df4 = df3.join(df2, df3.key1 == df2.key)
new_column_condition = expr(
"""IF(val = min_val, -1, IF(val = max_val, 1, 0))"""
)
df4 = df4.withColumn("col_class", new_column_condition)
df4.show()
df5 = df4.filter( '(min_val = val or max_val = val) and col_class <> 0' )
df5.show()
df6 = df5.join(df1, df5.rowId == df1.rowId)
df6.show()
df6.select([c for c in df6.columns if c in ['c1','c2', 'c3']]).distinct().show()
Returns:
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 4| 15| 3|
|200|100| 25|
+---+---+---+
Data wrangling the clue here.

Recursion in DAX

I don't know if this is even possible, but I'd like to be able to create a calculated column where each row is dependent on the rows above it.
A classic example of this is the Fibonacci sequence, where the sequence is defined by the recurrence relationship F(n) = F(n-1) + F(n-2) and seeds F(1) = F(2) = 1.
In table form,
Index Fibonacci
----------------
1 1
2 1
3 2
4 3
5 5
6 8
7 13
8 21
9 34
10 55
... ...
I want to be able to construct the Fibonacci column as a calculated column.
Now, I know that the Fibonacci sequence has a nice closed form where I can define
Fibonacci = (((1 + SQRT(5))/2)^[Index] - ((1 - SQRT(5))/2)^[Index])/SQRT(5)
or using the shallow diagonals of Pascal's triangle form:
Fibonacci =
SUMX (
ADDCOLUMNS (
SELECTCOLUMNS (
GENERATESERIES ( 0, FLOOR ( ( [Index] - 1 ) / 2, 1 ) ),
"ID", [Value]
),
"BinomCoeff", IF (
[ID] = 0,
1,
PRODUCTX (
GENERATESERIES ( 1, [ID] ),
DIVIDE ( [Index] - [ID] - [Value], [Value] )
)
)
),
[BinomCoeff]
)
but this is not the case for recursively defined functions in general (or for the purposes I'm actually interested in using this for).
In Excel, this is easy to do. You would write a formula like this
A3 = A2 + A1
or in R1C1 notation,
= R[-1]C + R[-2]C
but I just can't figure out if this is even possible in DAX.
Everything I've tried either doesn't work or gives a circular dependency error. For example,
Fibonacci =
VAR n = [Index]
RETURN
IF(Table1[Index] <= 2,
1,
SUMX(
FILTER(Table1,
Table1[Index] IN {n - 1, n - 2}),
Table1[Fibonacci]
)
)
gives the error message
A circular dependency was detected: Table1[Fibonacci].
Edit:
In the book Tabular Modeling in Microsoft SQL Server Analysis Services by Marco Russo and Alberto Ferrari, DAX is described and includes this paragraph:
As a pure functional language, DAX does not have imperative statements, but it leverages special functions called iterators that execute a certain expression for each row of a given table expression. These arguments are close to the lambda expression in functional languages. However, there are limitations in the way you can combine them, so we cannot say they correspond to a generic lambda expression definition. Despite its functional nature, DAX does not allow you to define new functions and does not provide recursion.
It appears there is no straightforward way to do recursion. I do still wonder if there is a way to still do it indirectly somehow using Parent-Child functions, which appear to be recursive in nature.
Edit 2:
While general recursion doesn't seem feasible, don't forget that recursive formulas may have a nice closed form that can be fairly easily derived.
Here are a couple of examples where I use this workaround to sidestep recursive formulas:
How to perform sum of previous cells of same column in PowerBI
DAX - formula referencing itself
Based on your first sample dataset, it looks to me like a "sort of" Cummulative Total, which can probably calculated easily in SQL using WINDOW function-- I tried a couple things but nothing panned out just yet. I don't work with DAX enough to say if it can be done.
Edit: In reviewing a little closer the Fibonacci sequence, it turns out that my SQL code doing cumulative comparison is not correct. You can read the SO Post How to generate Fibonacci Series, and it has a few good SQL Fibonacci answers that I tested; in particular the post by N J - answered Feb 13 '14. I'm not sure of a DAX Fibonacci recursion function capability.
SQL Code (not quite correct):
DECLARE #myTable as table (Indx int)
INSERT INTO #myTable VALUES
(1),(2),(3),(4),(5),(6),(7),(8),(9),(10)
SELECT
Indx
,SUM(myTable.Indx) OVER(ORDER BY myTable.Indx ASC ROWS BETWEEN UNBOUNDED PRECEDING and CURRENT ROW) -- + myTable.Indx
AS [Cummulative]
,SUM(myTable.Indx) OVER(ORDER BY myTable.Indx ASC ROWS BETWEEN UNBOUNDED PRECEDING and 2 PRECEDING)
+ SUM(myTable.Indx) OVER(ORDER BY myTable.Indx ASC ROWS BETWEEN UNBOUNDED PRECEDING and 1 PRECEDING)
AS [Fibonacci]
from #myTable myTable
Result Set:
+------+-------------+-----------+
| Indx | Cummulative | Fibonacci |
+------+-------------+-----------+
| 1 | 1 | NULL |
+------+-------------+-----------+
| 2 | 3 | NULL |
+------+-------------+-----------+
| 3 | 6 | 4 |
+------+-------------+-----------+
| 4 | 10 | 9 |
+------+-------------+-----------+
| 5 | 15 | 16 |
+------+-------------+-----------+
| 6 | 21 | 25 |
+------+-------------+-----------+
| 7 | 28 | 36 |
+------+-------------+-----------+
| 8 | 36 | 49 |
+------+-------------+-----------+
| 9 | 45 | 64 |
+------+-------------+-----------+
| 10 | 55 | 81 |
+------+-------------+-----------+
DAX Cummulative:
A link that could help calculate cumulative totals with DAX-- https://www.daxpatterns.com/cumulative-total/. And here is some sample code from the article.
Cumulative Quantity :=
CALCULATE (
SUM ( Transactions[Quantity] ),
FILTER (
ALL ( 'Date'[Date] ),
'Date'[Date] <= MAX ( 'Date'[Date] )
)
)
DAX language doesn't support recursion.
It's also been written in a sqlbi's article about calculation groups
DAX is not recursive, so Calculation Groups do not allow recursion. This is a good idea for controlling performance, but it requires a different approach compared to certain techniques that are possible in MDX Script by leveraging recursion.
https://www.sqlbi.com/blog/marco/2019/03/01/calculation-groups-in-dax-first-impressions/

Oracle add group function over result rows

I'm trying to add an aggregate function column to an existing result set. I've tried variations of OVER(), UNION, but cannot find a solution.
Example current result set:
ID ATTR VALUE
1 score 5
1 score 7
1 score 9
Example desired result set:
ID ATTR VALUE STDDEV (score)
1 score 5 2
1 score 7 2
1 score 9 2
Thank you
Seems like you're after:
stddev(value) over (partition by attr)
stddev(value) over (partition by id, attr)
It just depend on what you need to partition by. Based on sample data the attr should be enough; but I could see possibly the ID and attr.
Example:
With CTE (ID, Attr, Value) as (
SELECT 1, 'score', 5 from dual union all
SELECT 1, 'score', 7 from dual union all
SELECT 1, 'score', 9 from dual union all
SELECT 1, 'Z', 1 from dual union all
SELECT 1, 'Z', 5 from dual union all
SELECT 1, 'Z', 8 from dual)
SELECT A.*, stddev(value) over (partition by attr)
FROM cte A
ORDER BY attr, value
DOCS show that by adding an order by to the analytic, one can acquire the cumulative standard deviation per record.
Giving us:
+----+-------+-------+------------------------------------------+
| ID | attr | value | stdev |
+----+-------+-------+------------------------------------------+
| 1 | Z | 1 | 3.51188458428424628280046822063322249225 |
| 1 | Z | 5 | 3.51188458428424628280046822063322249225 |
| 1 | Z | 8 | 3.51188458428424628280046822063322249225 |
| 1 | score | 5 | 2 |
| 1 | score | 7 | 2 |
| 1 | score | 9 | 2 |
+----+-------+-------+------------------------------------------+

Oracle 8g - How can i create a dynamic table without pivot?

i need to make a dynamic table in Oracle 8g, but this version doesn't have the PIVOT property. I want to create a table like this.
Date | Code | count
12/04/2016 | a1 | 8
12/05/2016 | a2 | 10
10/06/2016 | a3 | 4
24/10/2016 | a2 | 6
Date | a1 | a2 | a3
12/04/2016 | 8 | |
12/05/2016 | | 10 |
10/06/2016 | | | 4
24/10/2016 | | 6 |
The numbers of codes is undefined. That would be the reason why i cant create a static table.
Use a "plain" pivot query:
SELECT Date,
max( CASE code WHEN 'a1' THEN count END ) As a1,
max( CASE code WHEN 'a2' THEN count END ) As a2,
max( CASE code WHEN 'a3' THEN count END ) As a3
FROM table
GROUP BY Date
PIVOT clause is only a syntactic sugar to make it easier to express the above query
The below query using PIVOT clause is the same as the above one.
SELECT *
FROM (SELECT date, code, count FROM table )
PIVOT (
max( count ) FOR code IN ( 'a1' as a1, 'a2' as a2, 'a3' as a3 )
)

Sample time serie by time interval with Hive QL and calculate jumps

I have time series data in a table. Basically each row has a timestamp and a value.
The frequency of the data is absolutely random.
I'd like to sample it with a given frequency and for each frequency extract relevant information about it: min, max, last, change (relative previous), return (change / previous) and maybe more (count...)
So here's my input:
08:00:10, 1
08:01:20, 2
08:01:21, 3
08:01:24, 5
08:02:24, 2
And I'd like to get the following result for 1 minute sampling (ts, min, max, last, change, return):
ts m M L Chg Return
08:01:00, 1, 1, 1, NULL, NULL
08:02:00, 2, 5, 5, 4, 4
08:03:00, 2, 2, 2, -3, -0.25
You could do it with something like this (comments inline):
SELECT
min
, mn
, mx
, l
, l - LAG(l, 1) OVER (ORDER BY min) c
-- This might not be the right calculation. Unsure how -0.25 was derived in question.
, (l - LAG(l, 1) OVER (ORDER BY min)) / (LAG(l, 1) OVER (ORDER BY min)) r
FROM
(
SELECT
min
, MIN(val) mn
, MAX(val) mx
-- We can take MAX here because all l's (last values) for the minute are the same.
, MAX(l) l
FROM
(
SELECT
min
, val
-- The last value of the minute, ordered by the timestamp, using all rows.
, LAST_VALUE(val) OVER (PARTITION BY min ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) l
FROM
(
SELECT
ts
-- Drop the seconds and go back one minute by converting to seconds,
-- subtracting 60, and then going back to a shorter string format.
-- 2000-01-01 is a dummy date just to enable the conversion.
, CONCAT(FROM_UNIXTIME(UNIX_TIMESTAMP(CONCAT("2000-01-01 ", ts), "yyyy-MM-dd HH:mm:ss") + 60, "HH:mm"), ":00") min
, val
FROM
-- As from the question.
21908430_input a
) val_by_min
) val_by_min_with_l
GROUP BY min
) min_with_l_m_M
ORDER BY min
;
Result:
+----------+----+----+---+------+------+
| min | mn | mx | l | c | r |
+----------+----+----+---+------+------+
| 08:01:00 | 1 | 1 | 1 | NULL | NULL |
| 08:02:00 | 2 | 5 | 5 | 4 | 4 |
| 08:03:00 | 2 | 2 | 2 | -3 | -0.6 |
+----------+----+----+---+------+------+

Resources