How to realize funnel analysis in ClickHouse - clickhouse

I want to do funnel analysis based on buried point data that are stored in ClickHouse. Let's define a few elements for funnel analysis:
A series of events: A (event_id = 1) -> B (event_id = 2) -> C (event_id = 3)
Time period: 0 (event_ms) ~ 500 (event_ms)
Time window: 100 (event_ms)
I want to know, for each user, if there is an event series (A->B->C) happened within the time period, and intervals between A and C is within the time window.
Here is my test dataset:
CREATE TABLE test_dataset
(
`event_id` UInt64,
`event_ms` UInt64,
`uid` UInt64 // user_id
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMMDD(toDate(event_ms))
ORDER BY (event_id, event_ms,uid)
SETTINGS index_granularity = 8192;
INSERT INTO TABLE test_dataset VALUES
(1, 100, 123),
(1, 120, 123),
(1, 130, 123),
(1, 150, 345),
(1, 180, 345),
(2, 150, 123),
(2, 200, 234),
(2, 140, 345),
(2, 210, 345),
(2, 300, 345),
(3, 180, 123),
(3, 250, 123),
(3, 290, 234),
(3, 270, 345);
I use join to find all qualified event series:
SELECT
t1.event_ms, t2.event_ms, t3.event_ms, t4.event_ms,
t1.uid, t2.uid, t3.uid, t4.uid
FROM
(SELECT
uid, event_ms
FROM funnel_join_test_1
WHERE
event_id = 1 AND event_ms >= 0 AND event_ms <= 500) as t1
ASOF left join
(SELECT
uid, event_ms
FROM funnel_join_test_1
WHERE
event_id = 2 AND event_ms >= 0 AND event_ms <= 500) as t2
ON t1.uid = t2.uid AND t1.event_ms < t2.event_ms
ASOF left join
(SELECT
uid, event_ms
FROM funnel_join_test_1
WHERE
event_id = 3 AND event_ms >= 0 and event_ms <= 500) as t3
ON t2.uid = t3.uid and t2.event_ms < t3.event_ms
ASOF left join
(SELECT
uid, event_ms
FROM funnel_join_test_1
WHERE
event_id = 3 AND event_ms >= 0 and event_ms <= 500) as t4
ON t3.uid = t4.uid and t4.event_ms < t1.event_ms + 100
WHERE t4.event_ms > 0;
Here are all qualified event series:
┌─t1.event_ms─┬─t2.event_ms─┬─t3.event_ms─┬─t4.event_ms─┬─t1.uid─┬─t2.uid─┬─t3.uid─┬─t4.uid─┐
│ 180 │ 210 │ 270 │ 270 │ 345 │ 345 │ 345 │ 345 │
└─────────────┴─────────────┴─────────────┴─────────────┴────────┴────────┴────────┴────────┘
┌─t1.event_ms─┬─t2.event_ms─┬─t3.event_ms─┬─t4.event_ms─┬─t1.uid─┬─t2.uid─┬─t3.uid─┬─t4.uid─┐
│ 120 │ 150 │ 180 │ 180 │ 123 │ 123 │ 123 │ 123 │
└─────────────┴─────────────┴─────────────┴─────────────┴────────┴────────┴────────┴────────┘
┌─t1.event_ms─┬─t2.event_ms─┬─t3.event_ms─┬─t4.event_ms─┬─t1.uid─┬─t2.uid─┬─t3.uid─┬─t4.uid─┐
│ 130 │ 150 │ 180 │ 180 │ 123 │ 123 │ 123 │ 123 │
└─────────────┴─────────────┴─────────────┴─────────────┴────────┴────────┴────────┴────────┘
┌─t1.event_ms─┬─t2.event_ms─┬─t3.event_ms─┬─t4.event_ms─┬─t1.uid─┬─t2.uid─┬─t3.uid─┬─t4.uid─┐
│ 100 │ 150 │ 180 │ 180 │ 123 │ 123 │ 123 │ 123 │
└─────────────┴─────────────┴─────────────┴─────────────┴────────┴────────┴────────┴────────┘
Then I know user 123 and 345 have such event series within the time period. Using join is pretty slow in ClickHouse, is there any other way to work around this problem?
By the way, I don't need to know all qualified series, I only want to know if there is one such event series for each user.

There are function windowFunnel that searches for chain of events in sliding window.
SELECT
uid,
windowFunnel(100)(event_ms, event_id = 1, event_id = 2, event_id = 3) AS chain_len
FROM test_dataset
WHERE (event_ms > 0) AND (event_ms < 500)
GROUP BY uid;
Result:
┌─uid─┬─chain_len─┐
│ 234 │ 0 │
│ 345 │ 3 │
│ 123 │ 3 │
└─────┴───────────┘
It returns matched chain length, so for users 345 and 123 we have 3 that means that whole chain is matched.
If we decrease window to 10 it will find only beginning of chain and don't match futher events due to condition timestamp of event 2 <= timestamp of event 1 + window is not hold.
SELECT
uid,
windowFunnel(10)(event_ms, event_id = 1, event_id = 2, event_id = 3) AS chain_len
FROM test_dataset
WHERE (event_ms > 0) AND (event_ms < 500)
GROUP BY uid
Result:
┌─uid─┬─chain_len─┐
│ 234 │ 0 │
│ 345 │ 1 │
│ 123 │ 1 │
└─────┴───────────┘
So, to check that is there such chain for user you can check that windowFunnel matched appropriate number of events.
Restriction on time interval (Time period: 0 (event_ms) ~ 500 (event_ms)), is simply handled in WHERE clause.
Add more events out of period:
INSERT INTO TABLE test_dataset VALUES (1, 600, 234), (2, 601, 234), (3, 602, 234);
Then check:
SELECT
uid,
windowFunnel(100)(event_ms, event_id = 1, event_id = 2, event_id = 3) AS chain_len
FROM test_dataset
WHERE (event_ms > 0) AND (event_ms < 500)
GROUP BY uid
Result:
┌─uid─┬─chain_len─┐
│ 234 │ 0 │
│ 345 │ 3 │
│ 123 │ 3 │
└─────┴───────────┘
Without WHERE
SELECT
uid,
windowFunnel(100)(event_ms, event_id = 1, event_id = 2, event_id = 3) AS chain_len
FROM test_dataset
GROUP BY uid
Result:
┌─uid─┬─chain_len─┐
│ 234 │ 3 │
│ 345 │ 3 │
│ 123 │ 3 │
└─────┴───────────┘

Related

How to check missing values in Clickhouse

I have a table that is filled with data every 15 minutes. I need to check that there is data for all days of the entire period. there is a time column in which the data is in the format yyyy-mm-dd hh:mm:ss
i've found the start date and the last date with
I found out that you can generate an array of dates from this interval (start and end dates) with which each line will be compared, and if there is no match, here it is the missing date.
i've tried this:
WITH dates_range AS (SELECT toDate(min(time)) AS start_date,
toDate(max(time)) AS end_date
FROM table)
SELECT dates
FROM (
SELECT arrayFlatten(arrayMap(x -> start_date + x, range(0, toUInt64(end_date - start_date)))) AS dates
FROM dates_range
)
LEFT JOIN (
SELECT toDate(time) AS date
FROM table
GROUP BY toDate(time)
) USING date
WHERE date IS NULL;
but it returns with Code: 10. DB::Exception: Not found column date in block. There are only columns: dates. (NOT_FOUND_COLUMN_IN_BLOCK) and I can't
You can also use WITH FILL modifier https://clickhouse.com/docs/en/sql-reference/statements/select/order-by/#order-by-expr-with-fill-modifier
create table T ( time DateTime) engine=Memory
as SELECT toDateTime('2020-01-01') + (((number * 60) * 24) * if((number % 33) = 0, 3, 1))
FROM numbers(550);
SELECT *
FROM
(
SELECT
toDate(time) AS t,
count() AS c
FROM T
GROUP BY t
ORDER BY t ASC WITH FILL
)
WHERE c = 0
┌──────────t─┬─c─┐
│ 2020-01-11 │ 0 │
│ 2020-01-13 │ 0 │
│ 2020-01-16 │ 0 │
│ 2020-01-18 │ 0 │
│ 2020-01-21 │ 0 │
│ 2020-01-23 │ 0 │
│ 2020-01-26 │ 0 │
└────────────┴───┘
create table T ( time DateTime) engine=Memory
as SELECT toDateTime('2020-01-01') + (((number * 60) * 24) * if((number % 33) = 0, 3, 1))
FROM numbers(550);
WITH (SELECT (toDate(min(time)), toDate(max(time))) FROM T) as x
select date, sumIf(cnt, type=1) c1, sumIf(cnt, type=2) c2 from
( SELECT arrayJoin(arrayFlatten(arrayMap(x -> x.1 + x, range(0, toUInt64(x.2 - x.1+1))))) AS date, 2 type, 1 cnt
union all SELECT toDate(time) AS date, 1 type, count() cnt FROM T GROUP BY toDate(time) )
group by date
having c1 = 0 or c2 = 0;
┌───────date─┬─c1─┬─c2─┐
│ 2020-01-11 │ 0 │ 1 │
│ 2020-01-13 │ 0 │ 1 │
│ 2020-01-16 │ 0 │ 1 │
│ 2020-01-18 │ 0 │ 1 │
│ 2020-01-21 │ 0 │ 1 │
│ 2020-01-23 │ 0 │ 1 │
│ 2020-01-26 │ 0 │ 1 │
└────────────┴────┴────┘
create table T ( time DateTime) engine=Memory
as SELECT toDateTime('2020-01-01') + (((number * 60) * 24) * if((number % 33) = 0, 3, 1))
FROM numbers(550);
WITH (SELECT (toDate(min(time)), toDate(max(time))) FROM T) as x
SELECT l.*, r.*
FROM ( SELECT arrayJoin(arrayFlatten(arrayMap(x -> x.1 + x, range(0, toUInt64(x.2 - x.1+1))))) AS date) l
LEFT JOIN ( SELECT toDate(time) AS date FROM T GROUP BY toDate(time)
) r USING date
WHERE r.date IS NULL settings join_use_nulls = 1;
┌───────date─┬─r.date─┐
│ 2020-01-11 │ ᴺᵁᴸᴸ │
│ 2020-01-13 │ ᴺᵁᴸᴸ │
│ 2020-01-16 │ ᴺᵁᴸᴸ │
│ 2020-01-18 │ ᴺᵁᴸᴸ │
│ 2020-01-21 │ ᴺᵁᴸᴸ │
│ 2020-01-23 │ ᴺᵁᴸᴸ │
│ 2020-01-26 │ ᴺᵁᴸᴸ │
└────────────┴────────┘
create table T ( time DateTime) engine=Memory
as SELECT toDateTime('2020-01-01') + (((number * 60) * 24) * if((number % 33) = 0, 3, 1))
FROM numbers(550);
select b from (
SELECT
b,
((b - any(b) OVER (ORDER BY b ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))) AS lag
FROM
(
SELECT toDate(time) AS b
FROM T
GROUP BY b
ORDER BY b ASC
)) where lag > 1 and lag < 10000
┌──────────b─┐
│ 2020-01-12 │
│ 2020-01-14 │
│ 2020-01-17 │
│ 2020-01-19 │
│ 2020-01-22 │
│ 2020-01-24 │
│ 2020-01-27 │
└────────────┘

Join two datasets with key duplicates one by one

I need to join two datasets from e.g. left and right source to match values by some keys. Datasets can contain duplicates:
┌─key─┬─value──┬─source──┐
│ 1 │ val1 │ left │
│ 1 │ val1 │ left │ << duplicate from left source
│ 1 │ val1 │ left │ << another duplicate from left source
│ 1 │ val1 │ right │
│ 1 │ val1 │ right │ << duplicate from right source
│ 2 │ val2 │ left │
│ 2 │ val3 │ right │
└─────┴────────┴─-----───┘
I cant use full join, it gives cartesian products of all duplicates.
I am trying to use group by instead:
select
`key`,
anyIf(value, source = 'left') as left_value,
anyIf(value, source = 'right') as right_value
from test_raw
group by key;
It works good, but is there any way to match left and right duplicates?
Expected result:
┌─key─┬─left_value─┬─right_value─┐
│ 1 │ val1 │ val1 │
│ 1 │ val1 │ val1 │
│ 1 │ val1 │ │
│ 2 │ val2 │ val3 │
└─────┴────────────┴─────────────┘
Scripts to reproduce:
create table test_raw
(`key` Int64,`value` String,`source` String)
ENGINE = Memory;
insert into test_raw (`key`,`value`,`source`)
values
(1, 'val1', 'left'),
(1, 'val1', 'left'),
(1, 'val1', 'left'),
(1, 'val1', 'right'),
(1, 'val1', 'right'),
(2, 'val2', 'left'),
(2, 'val3', 'right');
select
`key`,
anyIf(value, source = 'left') as left_value,
anyIf(value, source = 'right') as right_value
from test_raw
group by key;
SELECT
key,
left_value,
right_value
FROM
(
SELECT
key,
arraySort(groupArrayIf(value, source = 'left')) AS l,
arraySort(groupArrayIf(value, source = 'right')) AS r,
arrayMap(i -> (l[i + 1], r[i + 1]), range(greatest(length(l), length(r)))) AS t
FROM test_raw
GROUP BY key
)
ARRAY JOIN
t.1 AS left_value,
t.2 AS right_value
ORDER BY key ASC
┌─key─┬─left_value─┬─right_value─┐
│ 1 │ val1 │ val1 │
│ 1 │ val1 │ val1 │
│ 1 │ val1 │ │
│ 1 │ val1 │ │
│ 2 │ val2 │ val3 │
└─────┴────────────┴─────────────┘

Clickhouse - How can I get distinct values from all values inside an array type column

On a clickhouse database, I've an array type as column and I want to make an distinct for all elements inside them
Instead of getting this
Select distinct errors.message_grouping_fingerprint
FROM views
WHERE (session_date >= toDate('2022-07-21')) and (session_date < toDate('2022-07-22'))
and notEmpty(errors.message) = 1
and project_id = 162
SETTINGS distributed_group_by_no_merge=0
[-8964675922652096680,-8964675922652096680]
[-8964675922652096680]
[-8964675922652096680,-8964675922652096680,-8964675922652096680,-8964675922652096680,-8964675922652096680,-8964675922652096680,-8964675922652096680,-827009490898812590,-8964675922652096680,-8964675922652096680,-8964675922652096680,-8964675922652096680]
[-8964675922652096680,-8964675922652096680,-8964675922652096680]
[-827009490898812590]
[-1660275624223727714,-1660275624223727714]
[1852265010681444046]
[-2552644061611887546]
[-7142229185866234523]
[-7142229185866234523,-7142229185866234523]
To get this
-8964675922652096680
-827009490898812590
-1660275624223727714
1852265010681444046
-2552644061611887546
-7142229185866234523
and finally, to make a count of all them
as 6
groupUniqArrayArray
select arrayMap( i-> rand()%10, range(rand()%3+1)) arr from numbers(10);
┌─arr─────┐
│ [0] │
│ [1] │
│ [7,7,7] │
│ [8,8] │
│ [9,9,9] │
│ [6,6,6] │
│ [2,2] │
│ [8,8,8] │
│ [2] │
│ [8,8,8] │
└─────────┘
SELECT
groupUniqArrayArray(arr) AS uarr,
length(uarr)
FROM
(
SELECT arrayMap(i -> (rand() % 10), range((rand() % 3) + 1)) AS arr
FROM numbers(10)
)
┌─uarr──────────────┬─length(groupUniqArrayArray(arr))─┐
│ [0,5,9,4,2,8,7,3] │ 8 │
└───────────────────┴──────────────────────────────────┘
ARRAY JOIN
SELECT A
FROM
(
SELECT arrayMap(i -> (rand() % 10), range((rand() % 3) + 1)) AS arr
FROM numbers(10)
)
ARRAY JOIN arr AS A
GROUP BY A
┌─A─┐
│ 0 │
│ 1 │
│ 4 │
│ 5 │
│ 6 │
│ 9 │
└───┘

How to rewrite this deprecated expression using do and "by", with "groupby" (Julia)

The goal is to generate fake data.
We generate a set of parameters,
## Simulated data
df_3 = DataFrame(y = [0,1], size = [250,250], x1 =[2.,0.], x2 =[-1.,-2.])
Now, I want to generate the fake data per se,
df_knn =by(df_3, :y) do df
DataFrame(x_1 = rand(Normal(df[1,:x1],1), df[1,:size]),
x_2 = rand(Normal(df[1,:x2],1), df[1,:size]))
end
How I can replace by with groupby, here?
SOURCE: This excerpt is from the book, Data Science with Julia (2019).
I think this is what you mean here:
julia> combine(groupby(df_3, :y)) do df
DataFrame(x_1 = rand(Normal(df[1,:x1],1), df[1,:size]),
x_2 = rand(Normal(df[1,:x2],1), df[1,:size]))
end
500×3 DataFrame
Row │ y x_1 x_2
│ Int64 Float64 Float64
─────┼─────────────────────────────
1 │ 0 1.88483 0.890807
2 │ 0 2.50124 -0.280708
3 │ 0 1.1857 0.823002
⋮ │ ⋮ ⋮ ⋮
498 │ 1 -0.611168 -0.856527
499 │ 1 0.491412 -3.09562
500 │ 1 0.242016 -1.42652
494 rows omitted

Clickhouse - how do I do Natural sort query with limit?

I want my select queries able to do natural sort using these concepts: https://rosettacode.org/wiki/Natural_sorting
You can play with collation settings like in the query below.
Take into account that ClickHouse has the collation bug#7482 and fails for some languages such as en, de.
SELECT arrayJoin(['kk 50', 'KK 01', ' KK 2', ' KK 3', 'kk 1', 'x9y99', 'x9y100']) item
ORDER BY item ASC
/*
Result:
┌─item──────┐
│ KK 2 │
│ KK 3 │
│ KK 01 │
│ kk 1 │
│ kk 50 │
│ x9y100 │
│ x9y99 │
└───────────┘
*/
SELECT arrayJoin(['kk 50', 'KK 01', ' KK 2', ' KK 3', 'kk 1', 'x9y99', 'x9y100']) item
ORDER BY item ASC COLLATE 'tr-u-kn-true-ka-shifted'
/*
Result:
┌─item──────┐
│ kk 1 │
│ KK 01 │
│ KK 2 │
│ KK 3 │
│ kk 50 │
│ x9y99 │
│ x9y100 │
└───────────┘
*/

Resources