Why maxMerge() gives different results than max() of sumMerge()? - clickhouse

The whole point is to get peaks per period (e.g. 5m peaks) for value that accumulates. So it needs to be summed per period and then the peak (maximum) can be found in those sums. (select max(v) from (select sum(v) from t group by a1, a2))
I have a base table t.
Data are inserted into t, consider two attributes (time t1 and some string a2) and one numeric value.
Value accumulates so it needs to be summed to get the total volume over certain period. Example of rows inserted:
t1 | a2 | v
----------------
date1 | b | 1
date2 | c | 20
I'm using a MV to compute sumState() and from that I get peaks using sumMerge() and then max().
I need it only for max values so I was wondering I could use maxState() directly.
So this is what I do now: I use MV that computes a 5m sum and from that I read max()
CREATE TABLE IF NOT EXISTS sums_table ON CLUSTER '{cluster}' (
t1 DateTime,
a2 String,
v AggregateFunction(sum, UInt32)
)
ENGINE = ReplicatedAggregatingMergeTree(
'...',
'{replica}'
)
PARTITION BY toDate(t1)
ORDER BY (a2, t1)
PRIMARY KEY (a2);
CREATE MATERIALIZED VIEW IF NOT EXISTS mv_a
ON CLUSTER '{cluster}'
TO sums_table
AS
SELECT toStartOfFiveMinute(t1) AS t1, a2,
sumState(toUInt32(v)) AS v
FROM t
GROUP BY t1, a2
from that I'm able to read max of 5m sum for a2 using
SELECT
a2,
max(sum) AS max
FROM (
SELECT
t1,
a2,
sumMerge(v) AS sum
FROM sums_table
WHERE t1 BETWEEN :fromDateTime AND :toDateTime
GROUP BY t1, a2
)
GROUP BY a2
ORDER BY max DESC
That works perfectly.
So I wanted to achieve the same using maxState and maxMerge():
CREATE TABLE IF NOT EXISTS max_table ON CLUSTER '{cluster}' (
t1 DateTime,
a2 String,
max_v AggregateFunction(max, UInt32)
)
ENGINE = ReplicatedAggregatingMergeTree(
'...',
'{replica}'
)
PARTITION BY toDate(t1)
ORDER BY (a2, t1)
PRIMARY KEY (a2)
CREATE MATERIALIZED VIEW IF NOT EXISTS mv_b
ON CLUSTER '{cluster}'
TO max_table
AS
SELECT
t1,
a2
maxState(v) AS max_v
FROM (
SELECT
toStartOfFiveMinute(t1) AS t1,
a2,
toUInt32(sum(v)) AS v
FROM t
GROUP BY t1, a2
)
GROUP BY t1, a2
and I thought if I get a max per time (t1) and a2, and then select max of that per a2, I'd get the maximum value for each a2, but I'm getting totally different max values using this query compared to the max of sums mentioned above.
SELECT
a2,
max(max) AS max
FROM (
SELECT
t1,
a2,
maxMerge(v) AS max
FROM max_table
WHERE t1 BETWEEN :fromDateTime AND :toDateTime
GROUP BY t1, a2
) maxs_per_time_and_a2
GROUP BY a2
What did I do wrong? Do I get MVs wrong? Is it possible to use maxState with maxMerge for 2+ attributes to compute max over a longer period, let's say year?

SELECT
t1,
a2
maxState(v) AS max_v
FROM (
SELECT
toStartOfFiveMinute(t1) AS t1,
a2,
toUInt32(sum(v)) AS v
FROM t
GROUP BY t1, a2
)
GROUP BY t1, a2
This is incorrect. And impossible.
Because MV is an insert trigger. It never reads REAL table t.
You are getting max from sum of rows in insert buffer.
If you insert 1 row with v=10. You will get max_v = 10. MatView does not "know" that a previous insert has added some rows, their sum is not taken into account.

Related

SQL UNION Optimization

I have 4 tables named A1, A2, B1, B2.
To fulfill a requirement, I have two ways to write SQL queries. The first one is:
(A1 UNION ALL A2) A JOIN (B1 UNION ALL B2) B ON A.id = B.a_id WHERE ...
And the second one is:
(A1 JOIN B1 on A1.id = B1.a_id WHERE ...) UNION ALL (A2 JOIN B2 on A2.id = B2.a_id WHERE ... )
I tried both approaches and realized they both give the same execution time and query plans in some specific cases. But I'm unsure whether they will always give the same performance or not.
So my question is when the first/second one is better in terms of performance?
In terms of coding, I prefer the first one because I can create two views on (A1 UNION ALL A2) as well as (B1 UNION ALL B2) and treat them like two tables.
The second one is better:
(A1 JOIN B1 on A1.id = B1.a_id WHERE ...) UNION ALL (A2 JOIN B2 on A2.id = B2.a_id WHERE ... )
It gives more information to Oracle CBO optimizer about how your tables are related to each other. CBO can calculate potentials plans' costs more precisely. It's all about cardinality, column statistics, etc.
Purely functionally, and without knowing what's in the tables,the first seems better - if data matches in a1 and b2, your 2nd query won't join it.

Tableau sorting by value from measure

I have a measure in Tableau that has values a,b,c and a field of countries that each have varying values of a,b,c.
I want to make a sort that allows you to select either a,b, or c, and then ranks the countries accordingly by their value of a,b,or c.
I have tried to do a parameter to calculated field, but that doesn't work because that method is for sorting between measures, not values of measures.
This can be accomplished with a parameter and calculated field--what have you tried that didn't work?
[sort parameter]
A string value with allowable values of a, b, or c
If a, b, and c are each their own measure:
[sort field]
IF [sort parameter] = 'a' then
[a]
ELSEIF [sort parameter] = 'b' then
[b]
ELSE
[c]
END
If values of a, b, and c are in one measure (perhaps called [value]) and a separate column exists to distinguish (perhaps called [type]):
[sort field]
IF [sort parameter] = 'a' then
SUM(IIF([type] = 'a',[value],NULL)
ELSEIF [sort parameter] = 'b' then
SUM(IIF([type] = 'b',[value],NULL)
ELSE
SUM(IIF([type] = 'c',[value],NULL)
END
Sort your visualization by [sort field].

Table of all combinations possible with combination id

Here come a little ABAP challenge:
For an ABAP projet, i must build from an internal table with 2 columns (example1) another table containing all combinations possibles (example2).
"X" columns represent the parameter. "Y" represent the parameter value.
example1:
X(param)
Y(value)
A a1
A a2
A a3
B b1
B b2
C c1
C c2
In the result table (example2):
We must get all combinations with a numeric id (on 3 columns).
The new "z" column represent the combination id. For each combination there is a number of lines equal to the number of dictinct parameters(in our case 3 line for A,B and C).
"x" column still represent the parameter and "y" column the associated value.
example2:
z(combi num)
x(param)
y(value)
1 A a1
1 B b1
1 C c1
2 A a1
2 B b1
2 C c2
3 A a1
3 B b2
3 C c1
4 A a1
4 B b2
4 C c2
etc...
etc...
etc...
12 A a3
12 B b2
12 C c2
Another remark is that the number of parameters and the number of values per parameters is not fixed (the initial internal table can evolve a lot and so the combinations possibles).
We maybe need recursion but i'm not sure of it.
Here is a non-recursive way to do it, you might have to rewrite the parts that use the new 740 syntax. The idea is pretty simple, first transform the data into an internal table with one entry per parameter containing a table with the possible values, the LOOP loop. From there it is a simple matter of going through all the combinations and adding these to another internal table, the WHILE loop.
REPORT z_algorithm.
TYPES: ty_param TYPE char1,
ty_value TYPE char2,
BEGIN OF ty_struct,
x TYPE ty_param,
y TYPE ty_value,
END OF ty_struct,
BEGIN OF ty_combi,
z TYPE i,
s TYPE ty_struct,
END OF ty_combi.
TYPES: BEGIN OF ty_param_struct,
x TYPE ty_param,
ys TYPE STANDARD TABLE OF ty_value WITH DEFAULT KEY,
ix TYPE i,
END OF ty_param_struct.
DATA: tab TYPE STANDARD TABLE OF ty_struct,
params TYPE STANDARD TABLE OF ty_param_struct,
done TYPE abap_bool VALUE abap_false,
z TYPE i VALUE 0,
overflow TYPE abap_bool VALUE abap_false,
combis TYPE STANDARD TABLE OF ty_combi.
START-OF-SELECTION.
APPEND VALUE: #( x = 'A' y = 'a1' ) TO tab,
#( x = 'A' y = 'a2' ) TO tab,
#( x = 'A' y = 'a3' ) TO tab,
#( x = 'B' y = 'b1' ) TO tab,
#( x = 'B' y = 'b2' ) TO tab,
#( x = 'C' y = 'c1' ) TO tab,
#( x = 'C' y = 'c2' ) TO tab.
LOOP AT tab ASSIGNING FIELD-SYMBOL(<tab>).
READ TABLE params WITH KEY x = <tab>-x ASSIGNING FIELD-SYMBOL(<param>).
IF sy-subrc NE 0.
APPEND INITIAL LINE TO params ASSIGNING <param>.
<param>-x = <tab>-x.
<param>-ix = 1.
ENDIF.
APPEND <tab>-y TO <param>-ys.
ENDLOOP.
WHILE done EQ abap_false.
ADD 1 TO z.
overflow = abap_true.
done = abap_true.
LOOP AT params ASSIGNING <param>.
READ TABLE <param>-ys INDEX <param>-ix ASSIGNING FIELD-SYMBOL(<y>).
APPEND VALUE #( z = z s-x = <param>-x s-y = <y> ) TO combis.
IF overflow EQ abap_true.
ADD 1 TO <param>-ix.
ENDIF.
IF <param>-ix GT lines( <param>-ys ).
overflow = abap_true.
<param>-ix = 1.
ELSE.
overflow = abap_false.
done = abap_false.
ENDIF.
ENDLOOP.
ENDWHILE.

Relational Algebra: Select tuples based on whether an attribute is unique in a table

Given a table:
T = {A1, A2, A3, A4}
How do you write a relational algebra statement that picks all tuples that have the same value for A3 as another tuple in the table?
You do a equijoin with T and itself on column A3.
T2←T,T⋈T.A3=T2.A3 T2
Now any tuple from T will be connected with all tuples that have the same value for A3. You can further select for a specific value of A3 from T and project to the attributes from T2.

Algorithms to create a tabular representation of a DAG?

Given a DAG, in which each node belongs to a category, how can this graph be transformed into a table with a column for each category? The transformation doesn't have to be reversible, but should preserve useful information about the structure of the graph; and should be a 'natural' transformation, in the sense that a person looking at the graph and the table should not be surprised by any of the rows. It should also be compact, i.e. have few rows.
For example given a graph of nodes a1,b1,b2,c1 with edges a1->b1, a1->b2, b1->c1, b2->c1 (i.e. a diamond-shaped graph) I would expect to see the following table:
a b c
--------
a1 b1 c1
a1 b2 c1
I've thought about this problem quite a bit, but I'm having trouble coming up with an algorithm that gives intuitive results on certain graphs. Consider the graph a1,b1,c1 with edges a1->c1, b1->c1. I'd like the algorithm to produce this table:
a b c
--------
a1 b1 c1
But maybe it should produce this instead:
a b c
--------
a1 c1
a1 b1
I'm looking for creative ideas and insights into the problem. Feel free to vary to simplify or constrain the problem if you think it will help.
Brainstorm away!
Edit:
The transformation should always produce the same set of rows, although the order of rows does not matter.
The table should behave nicely when sorting and filtering using, e.g., Excel. This means that mutliple nodes cannot be packed into a single cell of the table - only one node per cell.
What you need is a variation of topological sorting. This is an algorithm that "sorts" graph vertexes as if a---->b edge meant a > b. Since the graph is a DAG, there is no cycles in it and this > relation is transitive, so at least one sorting order exists.
For your diamond-shaped graph two topological orders exist:
a1 b1 b2 c1
a1 b2 b1 c1
b1 and b2 items are not connected, even indirectly, therefore, they may be placed in any order.
After you sorted the graph, you know an approximation of order. My proposal is to fill the table in a straightforward way (1 vertex per line) and then "compact" the table. Perform sorting and pick the sequence you got as output. Fill the table from top to bottom, assigning a vertex to relevant column:
a b c
--------
a1
b2
b1
c1
Now compact the table by walking from top to bottom (and then make similar pass from bottom to top). On each iteration, you take a closer look to a "current" row (marked as =>) and to the "next" row.
If in a column nodes in current and next node differ, do nothing for this column:
from ----> to
X b c X b c
-------- --------
=> X1 . . X1 . .
X2 . . => X2 . .
If in a column X in the next row there is no vertex (table cell is empty) and in the current row there is vertex X1, then you sometimes should fill this empty cell with a vertex in the current row. But not always: you want your table to be logical, don't you? So copy the vertex if and only if there's no edge b--->X1, c--->X1, etc, for all vertexes in current row.
from ---> to
X b c X b c
-------- --------
=> X1 b c X1 b c
b1 c1 => X1 b1 c1
(Edit:) After first (forward) and second (backward) passes, you'll have such tables:
first second
a b c a b c
-------- --------
a1 a1 b2 c1
a1 b2 a1 b2 c1
a1 b1 a1 b1 c1
a1 b1 c1 a1 b1 c1
Then, just remove equal rows and you're done:
a b c
--------
a1 b2 c1
a1 b1 c1
And you should get a nice table. O(n^2).
How about compacting all reachable nodes from one node together in one cell ? For example, your first DAG should look like:
a b c
---------------
a1 [b1,b2]
b1 c1
b2 c1
It sounds like a train system map with stations within zones (a,b,c).
You could be generating a table of all possible routes in one direction. In which case "a1, b1, c1" would seem to imply a1->b1 so don't format it like that if you have only a1->c1, b1->c1
You could decide to produce a table by listing the longest routes starting in zone a,
using each edge only once, ending with the short leftover routes. Or allow edges to be reused only if they connect unused edges or extend a route.
In other words, do a depth first search, trying not to reuse edges (reject any path that doesn't include unused edges, and optionally trim used edges at the endpoints).
Here's what I ended up doing:
Find all paths emanating from a node without in-edges. (Could be expensive for some graphs, but works for mine)
Traverse each path to collect a row of values
Compact the rows
Compacting the rows is dones as follows.
For each pair of columns x,y
Construct a map of every value of x to it's possible values of y
Create another map For entries that only have one distinct value of y, mapping the value of x to its single value of y.
Fill in the blanks using these maps. When filling in a value, check for related blanks that can be filled.
This gives a very compact output and seems to meet all my requirements.

Resources