Multiple arrays Clickhouse - clickhouse
Problem:
Count distinct values in an array filtered by another array on same row (and agg higher).
Explanation:
Using this data:
In the Size D70, there are 5 pcs available (hqsize), but shops requests 15. By using the column accumulatedNeed, the 5 first stores in the column shops should receive items (since every store request 1 pcs). That is [4098,4101,4109,4076,4080].
It could also be that the values in accumulatedNeed would be [1,4,5,5,5,...,15], where shop 1 request 1 pcs, shop2 3 pcs, etc. Then only 3 stores would get.
In the size E75 there is enough stock, so every shop will receive (10 shops):
Now i want the distinct list of shops from D70 & E75, which would be be final result:
[4098,4101,4109,4076,4080,4062,4063,4067,4072,4075,4056,4058,4059,4061] (14 unique stores) (4109 is only counted once)
Wanted result:
[4098,4101,4109,4076,4080,4062,4063,4067,4072,4075,4056,4058,4059,4061]. (14 unique stores)
I'm totally open to structure the data otherwise if better.
The reason why it can't be precalculated is that the result depends on which shops that are filtered on.
Additional issue
The answer below from Vdimir is good and I've used it as basics for the final solution, but the solution does not cover (partial fullfillment).
If the stock number is in the runningNeed array we are all goodt, but remainers are not handled.
If you got:
select 5 as stock,[2,2,3,3] as need, [1,2,3,4] as shops, arrayCumSum(need) as runningNeed,arrayMap(x -> (x <= stock), runningNeed) as mask
You will get:
This is not correct since the 3rd shop should have 1 from stock (5-2-2 = 1)
I can't seem to get my head around how to make an array with "stock given", which in this case would be [2,2,1,0]
I use this query to create table with data similar to your screenshot:
CREATE TABLE t
(
Size String,
hqsize Int,
accumulatedNeed Array(Int),
shops Array(Int)
) engine = Memory;
INSERT INTO t VALUES ('D70', 5, [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], [4098,4101,4109,4076,4080,4083,4062,4063,4067,4072,4075,4056,4057,4058,4059]),('E75', 43, [1,2,3,4,5,6,7,8,9,10], [4109,4062,4063,4067,4072,4075,4056,4058,4059,4061]);
Find which shops that can receive enough items:
SELECT arrayMap(x -> (x <= hqsize), accumulatedNeed) as mask FROM t;
┌─mask────────────────────────────┐
│ [1,1,1,1,1,0,0,0,0,0,0,0,0,0,0] │
│ [1,1,1,1,1,1,1,1,1,1] │
└─────────────────────────────────┘
Filter not fulfilled shops according to this mask:
Note that shops and accumulatedNeed have to have equals sizes.
SELECT arrayFilter((x,y) -> y, shops, mask) as fulfilled_shops, arrayMap(x -> (x <= hqsize), accumulatedNeed) as mask FROM t;
┌─fulfilled_shops─────────────────────────────────────┬─mask────────────────────────────┐
│ [4098,4101,4109,4076,4080] │ [1,1,1,1,1,0,0,0,0,0,0,0,0,0,0] │
│ [4109,4062,4063,4067,4072,4075,4056,4058,4059,4061] │ [1,1,1,1,1,1,1,1,1,1] │
└─────────────────────────────────────────────────────┴─────────────────────────────────┘
Then you can create table with all distinct shops:
SELECT DISTINCT arrayJoin(fulfilled_shops) as shops FROM (
SELECT arrayMap(x -> (x <= hqsize), accumulatedNeed) as mask, arrayFilter((x,y) -> y, shops, mask) as fulfilled_shops FROM t
);
┌─shops─┐
│ 4098 │
│ 4101 │
│ 4109 │
│ 4076 │
│ 4080 │
│ 4062 │
│ 4063 │
│ 4067 │
│ 4072 │
│ 4075 │
│ 4056 │
│ 4058 │
│ 4059 │
│ 4061 │
└───────┘
14 rows in set. Elapsed: 0.049 sec.
Or if you need single array group it back:
SELECT groupArrayDistinct(arrayJoin(fulfilled_shops)) as shops FROM (
SELECT arrayMap(x -> (x <= hqsize), accumulatedNeed) as mask, arrayFilter((x,y) -> y, shops, mask) as fulfilled_shops FROM t
);
┌─shops───────────────────────────────────────────────────────────────────┐
│ [4080,4076,4101,4075,4056,4061,4062,4063,4109,4058,4067,4059,4072,4098] │
└─────────────────────────────────────────────────────────────────────────┘
If you need data only from D70 & E75 you can filter extra rows from table with WHERE before.
Related
Idea for algorithm to arrange balls of different weights into boxes
The problem Suppose I have some balls and box types: Each ball has a different weight. Each type of box has its min and max capacity, and a penalty when used. There are unlimited number of boxes for each type. How can I arrange the balls into the least boxes such that: The total weight of the balls in each box is within its min and max capacity. The total penalty of the used boxes is minimized. There may be multiple solutions. However, the accepted solution is where the total weight of the balls in each box is nearest to its max capacity. Example For example, there are 5 balls of weight 31, 14, 13, 12, 7 respectively, and 3 box types: type│ min │ max │penalty ────┼─────┼─────┼─────── A │ 11 │ 20 │ 1 B │ 21 │ 30 │ 1 C │ 31 │ 40 │ 5 The possible combinations are: boxTypes│ 31 │ 14 │ 13 │ 12 │ 7 │ penalty ────────┼────┼────┼────┼────┼────┼───────── ABC │ C │ B │ B │ A │ C │ 7 BBC │ C │ B1 │ B2 │ B2 │ B1 │ 7 CC │ C1 │ C2 │ C2 │ C2 │ C1 │ 10 ACC │ C1 │ C2 │ C2 │ A │ C2 │ 11 and many other unlisted possibilities where the set of box types are the same or the penalty is just too high. Notice that there are 2 solutions with the same penalty. However, considering the third condition: boxTypes │ box1 │ box2 │ box3 │ shortfall ─────────┼──────┼──────┼──────┼────────────────────────────────── ABC │ 12 │ 27 │ 38 │ (20-12) + (30-27) + (40-38) = 13 BBC │ 21 │ 25 │ 31 │ (30-25) + (30-25) + (40-31) = 19 The ABC box combination is chosen due to filling the most capacity of the boxes. My code I am currently recursively generating all combinations of the balls, and check whether there is a set of box that fits the ball groups. I am able to improve the performance by: Early halt when a group weight is out of the maximum capacity (40 in this example) Limit the number of boxes (2 - 3 instead of 5, i.e. 1 box for each ball) However, my solution still cannot handle more than 15 balls. Is there a better algorithm other than bruteforce to solve this problem?
Clickhouse pass data for inner join
In PostgreSQL we can join tables with custom data. For example: select * from points p inner join (VALUES (5, '2000-1-1'::date, 1, 1)) as x(id, create_date, store_id, supplier_id) on p.id = x.id Does such kind of join exist in Clickhouse? If yes, how should I write it?
https://github.com/ClickHouse/ClickHouse/issues/5984 SELECT * FROM VALUES('a UInt64, s String', (1, 'one'), (2, 'two'), (3, 'three')) ┌─a─┬─s─────┐ │ 1 │ one │ │ 2 │ two │ │ 3 │ three │ └───┴───────┘ WITH [(toUInt64(1), 'one'), (2, 'two'), (3, 'three')] AS rows_array, arrayJoin(rows_array) AS row_tuple SELECT row_tuple.1 AS number_decimal, row_tuple.2 AS number_string ┌─number_decimal─┬─number_string─┐ │ 1 │ one │ │ 2 │ two │ │ 3 │ three │ └────────────────┴───────────────┘
Clickhouse: runningAccumulate() does not work as I expect
Say, we have a table testint. SELECT * FROM testint ┌─f1─┬─f2─┐ │ 2 │ 3 │ │ 2 │ 3 │ │ 4 │ 5 │ │ 4 │ 5 │ │ 6 │ 7 │ │ 6 │ 7 │ └────┴────┘ We try to query runningAccumulate() with sumState(). SELECT runningAccumulate(col) FROM ( SELECT sumState(f1) AS col FROM testint GROUP BY f1 ) ┌─runningAccumulate(col)─┐ │ 8 │ │ 12 │ │ 24 │ └────────────────────────┘ Why is the first row in the response 8, and not 4? If we are grouping by f1, the first row seems to be 4 (we do sum the first 2 and the second 2 in the column f1).
For accumulate-functions the order of elements is important, so just add ORDER BY to fix it: SELECT runningAccumulate(col) FROM ( SELECT sumState(f1) AS col FROM testint GROUP BY f1 ORDER BY f1 ASC /* <-- */ ) You got the result [8, 12, 24] for input data [8, 4, 12] when should be used the ordered input - [4, 8, 12].
How to use results of [with totals] modifier
We have modifier [with totals] that can summarize values across all rows and get the total result with key value=0 or null or smth like this The problem is that I don't understand how I can use these values in the next calculations Maybe I'm using the wrong format select processing_date,count(*) from `telegram.message` where processing_date>='2019-05-01' group by processing_date with totals
The documentation says that You can use WITH TOTALS in subqueries, including subqueries in the JOIN clause (in this case, the respective total values are combined). Example subqueries in the JOIN (CH tests scripts in github): SELECT k, s1, s2 FROM ( SELECT intDiv(number, 3) AS k, sum(number) AS s1 FROM ( SELECT * FROM system.numbers LIMIT 10 ) GROUP BY k WITH TOTALS ) ANY LEFT JOIN ( SELECT intDiv(number, 4) AS k, sum(number) AS s2 FROM ( SELECT * FROM system.numbers LIMIT 10 ) GROUP BY k WITH TOTALS ) USING (k) ORDER BY k ASC /* Result: ┌─k─┬─s1─┬─s2─┐ │ 0 │ 3 │ 6 │ │ 1 │ 12 │ 22 │ │ 2 │ 21 │ 17 │ │ 3 │ 9 │ 0 │ └───┴────┴────┘ Totals: ┌─k─┬─s1─┬─s2─┐ │ 0 │ 45 │ 45 │ └───┴────┴────┘ */ As a workaround, you can combine results of several totals using client libraries.
Using "with rollup" instead of "with totals" decides problems with format
Algorithm to find best dimensions combination
I am looking for an algorithm to find the best dimension combination to accomplish a desired result. Take the following as example: | A | B | C | y | |--------|--------|-------|-----| | dog | house1 | green | 30 | | dog | house1 | blue | 15 | | cat | house1 | green | 20 | | cat | house2 | red | 5 | | turtle | house3 | green | 50 | A, B, C are the measured dimensions. y is the measured result. If I want to get all combinations of dimensions that accomplish y >= 50 so the results will be: turtle, house3, green turtle, any, green turtle, house3, any turtle, any, any any, house3, green any, house3, any any, any, green any, house1, green any, house1, any Maybe it's a easy problem but I was trying to figure an optimal solution in terms of O(n) and I didn't found it.
Start with a work queue containing (any, any, ..., any), 0. The elements of this queue will be pairs consisting of a combination and a number of elements on the left that cannot be changed from any (this will make more sense shortly). Until the work queue is empty, remove one element from it and compute the corresponding sum. If it doesn't meet the threshold, then discard it. Otherwise, report it as one of the sought combinations. For each any that can be changed, for each value in that column, enqueue a combination consisting of the current one with any replaced by that value, with the index locking down all previous any values. Considering an output-sensitive bound, this is within a polynomial factor of optimal (in general, there can be exponentially many combinations). In Python 3: def overthreshold(data, threshold): queue = [(('any',) * len(data[0][0]), 0)] for combination, begin in queue: if sum(row[1] for row in data if all(x in {'any', y} for x, y in zip(combination, row[0]))) < threshold: continue yield combination for i in range(begin, len(combination)): if combination[i] == 'any': queue.extend((combination[:i] + (x,) + combination[i+1:], i + 1) for x in {row[0][i] for row in data}) def demo(): data = [ (('dog', 'house1', 'green'), 30), (('dog', 'house1', 'blue'), 15), (('cat', 'house1', 'green'), 20), (('cat', 'house2', 'red'), 5), (('turtle', 'house3', 'green'), 50), ] for combination in overthreshold(data, 50): print(combination)
Back here, 8 years later to answer the question using ClickHouse: WITH table AS ( SELECT 'dog' AS a, 'house1' AS b, 'green' AS c, 30 AS y UNION ALL SELECT 'dog' AS a, 'house1' AS b, 'blue' AS c, 15 AS y UNION ALL SELECT 'cat' AS a, 'house1' AS b, 'green' AS c, 20 AS y UNION ALL SELECT 'cat' AS a, 'house2' AS b, 'red' AS c, 5 AS y UNION ALL SELECT 'turtle' AS a, 'house3' AS b, 'green' AS c, 50 AS y ) SELECT a, b, c, sum(y) y FROM table GROUP BY CUBE(a, b, c) HAVING y >= 50 FORMAT PrettyCompactMonoBlock; ┌─a──────┬─b──────┬─c─────┬───y─┐ │ turtle │ house3 │ green │ 50 │ │ turtle │ house3 │ │ 50 │ │ turtle │ │ green │ 50 │ │ turtle │ │ │ 50 │ │ │ house3 │ green │ 50 │ │ │ house1 │ green │ 50 │ │ │ house3 │ │ 50 │ │ │ house1 │ │ 65 │ │ │ │ green │ 100 │ │ │ │ │ 120 │ └────────┴────────┴───────┴─────┘