I have time series data in a table. Basically each row has a timestamp and a value.
The frequency of the data is absolutely random.
I'd like to sample it with a given frequency and for each frequency extract relevant information about it: min, max, last, change (relative previous), return (change / previous) and maybe more (count...)
So here's my input:
08:00:10, 1
08:01:20, 2
08:01:21, 3
08:01:24, 5
08:02:24, 2
And I'd like to get the following result for 1 minute sampling (ts, min, max, last, change, return):
ts m M L Chg Return
08:01:00, 1, 1, 1, NULL, NULL
08:02:00, 2, 5, 5, 4, 4
08:03:00, 2, 2, 2, -3, -0.25
You could do it with something like this (comments inline):
SELECT
min
, mn
, mx
, l
, l - LAG(l, 1) OVER (ORDER BY min) c
-- This might not be the right calculation. Unsure how -0.25 was derived in question.
, (l - LAG(l, 1) OVER (ORDER BY min)) / (LAG(l, 1) OVER (ORDER BY min)) r
FROM
(
SELECT
min
, MIN(val) mn
, MAX(val) mx
-- We can take MAX here because all l's (last values) for the minute are the same.
, MAX(l) l
FROM
(
SELECT
min
, val
-- The last value of the minute, ordered by the timestamp, using all rows.
, LAST_VALUE(val) OVER (PARTITION BY min ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) l
FROM
(
SELECT
ts
-- Drop the seconds and go back one minute by converting to seconds,
-- subtracting 60, and then going back to a shorter string format.
-- 2000-01-01 is a dummy date just to enable the conversion.
, CONCAT(FROM_UNIXTIME(UNIX_TIMESTAMP(CONCAT("2000-01-01 ", ts), "yyyy-MM-dd HH:mm:ss") + 60, "HH:mm"), ":00") min
, val
FROM
-- As from the question.
21908430_input a
) val_by_min
) val_by_min_with_l
GROUP BY min
) min_with_l_m_M
ORDER BY min
;
Result:
+----------+----+----+---+------+------+
| min | mn | mx | l | c | r |
+----------+----+----+---+------+------+
| 08:01:00 | 1 | 1 | 1 | NULL | NULL |
| 08:02:00 | 2 | 5 | 5 | 4 | 4 |
| 08:03:00 | 2 | 2 | 2 | -3 | -0.6 |
+----------+----+----+---+------+------+
Related
I am trying to replicate this Excel formula or for loop that calculates the current row value based on the previous row value of the same column.
When value = 1, Result = Rank * 1
Else, Result = Rank * previous Result
| Value | Rank | Result |
| ----- | ---- | ------ |
| 1 | 3 | 3 |
| 2 | 2 | 6 |
| 3 | 1 | 6 |
I already tried generating a series table to get the Value and Rank columns but I am unable to refer to the existing Result column to update the same. Is there a way to get this done using DAX for dynamic number of rows?
It can be done, but you need to get your mindset out of recursion. Notice that I did [Rank] even for [Value] equal 1, as [Rank] * [Value] in that case is equal [Rank] (value equals 1, right?).
Result =
IF (
Table[Value] = 1,
[Rank],
[Rank] *
PRODUCTX (
FILTER ( Table, Table[Value] < EARLIER ( Table[Value] ) ),
[Rank]
)
)
EDIT:
The previous one was unneccessarily complex:
Result =
CALCULATE (
PRODUCT ( [Rank] ),
FILTER ( Table, [Value] <= EARLIER ( [Value] ) )
)
I'm trying to reduce a big dataset to rows having minimum and maximum values for each column. In other words, I would like, for every column of this dataset to get one row that has the minimum value on that column, as well as another that has the maximum value on the same column. I should mention that I do not know in advance what columns this dataset will have. Here's an example:
+----+----+----+ +----+----+----+
|Col1|Col2|Col3| ==> |Col1|Col2|Col3|
+----+----+----+ +----+----+----+
| F | 99 | 17 | | A | 34 | 25 |
| M | 32 | 20 | | Z | 51 | 49 |
| D | 2 | 84 | | D | 2 | 84 |
| H | 67 | 90 | | F | 99 | 17 |
| P | 54 | 75 | | C | 18 | 9 |
| C | 18 | 9 | | H | 67 | 90 |
| Z | 51 | 49 | +----+----+----+
| A | 34 | 25 |
+----+----+----+
The first row is selected because A is the smallest value on Col1. The second because Z is the largest value on Col1. The third because 2 is the smallest on Col2, and so on. The code below seems to do the right thing (correct me if I'm wrong), but performance is sloooow. I start with getting a dataframe from a random .csv file:
input_file = (sqlContext.read
.format("csv")
.options(header="true", inferSchema="true", delimiter=";", charset="UTF-8")
.load("/FileStore/tables/random.csv")
)
Then I create two other dataframes that each have one row with the min and respectively, max values of each column:
from pyspark.sql.functions import col, min, max
min_values = input_file.select(
*[min(col(col_name)).name(col_name) for col_name in input_file.columns]
)
max_values = input_file.select(
*[max(col(col_name)).name(col_name) for col_name in input_file.columns]
)
Finally, I repeatedly join the original input file to these two dataframes holding minimum and maximum values, using every column in turn, and do a union between all the results.
min_max_rows = (
input_file
.join(min_values, input_file[input_file.columns[0]] == min_values[input_file.columns[0]])
.select(input_file["*"]).limit(1)
.union(
input_file
.join(max_values, input_file[input_file.columns[0]] == max_values[input_file.columns[0]])
.select(input_file["*"]).limit(1)
)
)
for c in input_file.columns[1:]:
min_max_rows = min_max_rows.union(
input_file
.join(min_values, input_file[c] == min_values[c])
.select(input_file["*"]).limit(1)
.union(
input_file
.join(max_values, input_file[c] == max_values[c])
.select(input_file["*"]).limit(1)
)
)
min_max_rows.dropDuplicates()
For my test dataset of 500k rows, 40 columns, doing all this takes about 7-8 minutes on a standard Databricks cluster. I'm supposed to sift through more than 20 times this amount of data regularly. Is there any way to optimize this code? I'm quite afraid I've taken the naive approach to it, since I'm quite new to Spark.
Thanks!
Does not seem to be a popular question, but interesting (for me). And a lot of work for 15 pts. In fact I got it wrong first time round.
Here is a scaleable solution that you can partition accordingly to increase throughput.
Hard to explain, manipulation of the data and transposing the data is
the key issue here - and some lateral thinking.
I did not focus on variable columns all sorts of data types. That needs to be solved by yourself, can be done but some if else logic required to check if alpha or double or numeric. Mixing data types and applying to stuff gets problematic, but can be solved. I gave a notion of num_string, but did not complete that.
I have focused on the scalability issue and approach, with less procedural logic. Smaller sample size with all numbers, but correct as now as far as I can see. General principle is there.
Try it. Success.
Code:
from pyspark.sql.functions import *
from pyspark.sql.types import *
def reshape(df, by):
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
df1 = spark.createDataFrame(
[(4, 15, 3), (200, 100, 25), (7, 16, 4)], ("c1", "c2", "c3"))
df1 = df1.withColumn("rowId", monotonically_increasing_id())
df1.cache
df1.show()
df2 = reshape(df1, ["rowId"])
df2.show()
# In case you have other types like characters in the other column - not focusing on that aspect
df3 = df2.withColumn("num_string", format_string("%09d", col("val")))
# Avoid column name issues.
df3 = df3.withColumn("key1", col("key"))
df3.show()
df3 = df3.groupby('key1').agg(min(col("val")).alias("min_val"), max(col("val")).alias("max_val"))
df3.show()
df4 = df3.join(df2, df3.key1 == df2.key)
new_column_condition = expr(
"""IF(val = min_val, -1, IF(val = max_val, 1, 0))"""
)
df4 = df4.withColumn("col_class", new_column_condition)
df4.show()
df5 = df4.filter( '(min_val = val or max_val = val) and col_class <> 0' )
df5.show()
df6 = df5.join(df1, df5.rowId == df1.rowId)
df6.show()
df6.select([c for c in df6.columns if c in ['c1','c2', 'c3']]).distinct().show()
Returns:
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 4| 15| 3|
|200|100| 25|
+---+---+---+
Data wrangling the clue here.
I'm trying to add an aggregate function column to an existing result set. I've tried variations of OVER(), UNION, but cannot find a solution.
Example current result set:
ID ATTR VALUE
1 score 5
1 score 7
1 score 9
Example desired result set:
ID ATTR VALUE STDDEV (score)
1 score 5 2
1 score 7 2
1 score 9 2
Thank you
Seems like you're after:
stddev(value) over (partition by attr)
stddev(value) over (partition by id, attr)
It just depend on what you need to partition by. Based on sample data the attr should be enough; but I could see possibly the ID and attr.
Example:
With CTE (ID, Attr, Value) as (
SELECT 1, 'score', 5 from dual union all
SELECT 1, 'score', 7 from dual union all
SELECT 1, 'score', 9 from dual union all
SELECT 1, 'Z', 1 from dual union all
SELECT 1, 'Z', 5 from dual union all
SELECT 1, 'Z', 8 from dual)
SELECT A.*, stddev(value) over (partition by attr)
FROM cte A
ORDER BY attr, value
DOCS show that by adding an order by to the analytic, one can acquire the cumulative standard deviation per record.
Giving us:
+----+-------+-------+------------------------------------------+
| ID | attr | value | stdev |
+----+-------+-------+------------------------------------------+
| 1 | Z | 1 | 3.51188458428424628280046822063322249225 |
| 1 | Z | 5 | 3.51188458428424628280046822063322249225 |
| 1 | Z | 8 | 3.51188458428424628280046822063322249225 |
| 1 | score | 5 | 2 |
| 1 | score | 7 | 2 |
| 1 | score | 9 | 2 |
+----+-------+-------+------------------------------------------+
Column VAL is a number list from 1 to 3, the other columns are supposed to show:
A) MIN of all lower values than VAL
B) MAX of all lower values than VAL
C) MIN of all greater values than VAL
D) MAX of all greater
values than VAL
I would expect this result:
V A B C D
-------------------
1 | | | 2 | 3
2 | 1 | 1 | 3 | 3
3 | 1 | 2 | |
But the result I get is:
V A B C D
-------------------
1 | | | 2 | 3
2 | | | |
3 | | | |
(*) All blank cells are NULL results
The query I wrote:
WITH T AS
(SELECT CAST(LEVEL AS NUMBER) val
FROM DUAL
CONNECT BY LEVEL < 4)
SELECT val
,MIN(val) OVER(ORDER BY val RANGE BETWEEN UNBOUNDED PRECEDING AND val PRECEDING) A --MIN_PRECEDING
,MAX(val) OVER(ORDER BY val RANGE BETWEEN UNBOUNDED PRECEDING AND val PRECEDING) B --MAX_PRECEDING
,MIN(val) OVER(ORDER BY val RANGE BETWEEN val FOLLOWING AND UNBOUNDED FOLLOWING) C --MIN_FOLLOWING
,MAX(val) OVER(ORDER BY val RANGE BETWEEN val FOLLOWING AND UNBOUNDED FOLLOWING) D --MAX_FOLLOWING
FROM T
WHERE val IS NOT NULL
ORDER BY 1
/
Does anybody see what's wrong with this query?
Thanks in advance!
The error is in val preceding and val following. It should be 1 preceding and 1 following.
The number you specify there is relative to the current record, the record corresponding to val (in the given window order), so if you specify val there you are going back (or ahead) too far. You should need to get the min/max up to one record before (or after) the current record.
So:
WITH T AS
(SELECT CAST(LEVEL AS NUMBER) val
FROM DUAL
CONNECT BY LEVEL < 4)
SELECT val
,MIN(val) OVER(ORDER BY val RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) A
,MAX(val) OVER(ORDER BY val RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) B
,MIN(val) OVER(ORDER BY val RANGE BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) C
,MAX(val) OVER(ORDER BY val RANGE BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) D
FROM T
WHERE val IS NOT NULL
ORDER BY 1
/
Please find below the rule of calculating the check digit. The customer number will be 8 digit number. The first 7 digits will be a series and the 8th digit will be a check digit as demonstrated below:
Add all the odd position digits of the running number
Add all the even position digits of the running number
Multiply the sum of above two
Take the modulo 10 of the sum
So as per the above rule the first account number will be:
First part : 0030001
Sum of odd position digits : 0+3+0+1=4
Sum of even position digits : 0+0+0=0
Product of above two sums : 4x0 = 0
Modulo 10 of above product = 0
The complete account number = 00300010
Second account number:
First part : 0030002
Sum of odd position digits : 0+3+0+2=4
Sum of even position digits : 0+0+0=0
Product of above two sums : 4x0 = 0
Modulo 10 of above product = 0
The complete account number = 00300020
Also the following thing to be maintained in the account number format
Length each account has to be of 8 digits (seven digit running number and one check digit, the last digit)
Leading zeros has to be there
The running number must start from 30001
Taking the oracle tag as a hint (and not yet allowed to comment)…
WITH
Input (str, note) AS (
SELECT '00300010', 'OK' FROM DUAL UNION ALL
SELECT '00300020', 'OK' FROM DUAL UNION ALL
SELECT '00300025', 'check digit wrong' FROM DUAL UNION ALL
SELECT '00200010', '< 30001' FROM DUAL UNION ALL
SELECT '0300010', 'too short' FROM DUAL UNION ALL
SELECT '000200010', 'too long' FROM DUAL UNION ALL
SELECT 'a0300025', 'not a number' FROM DUAL
)
SELECT
str
, note
, CASE
WHEN REGEXP_REPLACE(str, '\d{8}', '', 1, 1, '') IS NULL AND
TO_NUMBER(SUBSTR(str, 1, 7)) > 30000 AND
MOD((TO_NUMBER(SUBSTR(str, 1, 1)) +
TO_NUMBER(SUBSTR(str, 3, 1)) +
TO_NUMBER(SUBSTR(str, 5, 1)) +
TO_NUMBER(SUBSTR(str, 7, 1)))
*
(TO_NUMBER(SUBSTR(str, 2, 1)) +
TO_NUMBER(SUBSTR(str, 4, 1)) +
TO_NUMBER(SUBSTR(str, 6, 1))),
10)
= TO_NUMBER(SUBSTR(str, 8, 1))
THEN 'TRUE'
ELSE 'FALSE'
END valid
FROM Input
;
returns
| STR | NOTE | VALID |
|-----------|-------------------|-------|
| 00300010 | OK | TRUE |
| 00300020 | OK | TRUE |
| 00300025 | check digit wrong | FALSE |
| 00200010 | < 30001 | FALSE |
| 0300010 | too short | FALSE |
| 000200010 | too long | FALSE |
| a0300025 | not a number | FALSE |
SQL Fiddle
Which uses SUBSTR(str, 1, 1) to get the respective figures…