Data is like starttime, endtime, id, a, b, c, d, e, f, g...
How to create index on clickhouse, most sql is as follows:
1.select starttime,endtime,id,a,b,c,d,e,f,g from tbl1
where starttime>=? and endtime<=? and id=?
2.select a,c,sum(f),avg(g) from tbl1
where starttime>=? and endtime<=?
group by a,c
order by sum(f) desc
limit 20
3.select starttime,endtime,id,a,b,c,d,e,f,g from tbl1
where starttime>=? and endtime<=? and a=?
limit 20
4.select a,c,sum(f),avg(g) from tbl1
where starttime>=? and endtime<=? and a=? and c=?
group by a,c
order by sum(f) desc
limit 20
5.select a,b,c,d,e,f,g from tbl1
where starttime>=? and endtime<=? and a=? and d=? and e=?
order by a,d,e
limit 20
tips:
a) always have starttime, endtime
b) some SQL have an id to search small data, example 1. but others to search large data, example 2,3,4,5
Clickhouse is a COLUMN oriented database with one Primary Key, he store each column on separate optimized "storage" which doesn't need secondary indexes
you can choose *MergeTree table engine with starttime and endtime in primary key
every kind of MergeTree engine describe here https://clickhouse.yandex/docs/en/operations/table_engines/mergetree/
All your queries will run as fast as possible
Related
I have two tables in Oracle and I have to synchronize values (Field column) between the tables. I'm using Informatica PowerCenter for this synchronization operation. The source qualifier query causes high I/O usage and I need to solve it.
Table1
Table1 has about 20M data. Field in Table1 is the actual field. Timestamp field holds create & update date and it has daily partition.
Id
Field
Timestamp
1
A
2017-05-12 03:13:40
2
B
2002-11-01 07:30:46
3
C
2008-03-03 03:26:29
Table2
Table2 has about 500M data. Field in Table2 should be as sync as possible to Field in Table1. Timestamp field holds create & update date and it has daily partition. Table2 is also target in the mapping.
Id
Table1_Id
Field
Timestamp
Action
100
1
A
2005-09-30 03:20:41
Nothing
101
1
B
2015-06-29 09:41:44
Update Field as A
102
1
C
2016-01-10 23:35:49
Update Field as A
103
2
A
2019-05-08 07:42:46
Update Field as B
104
2
B
2003-06-02 11:23:57
Nothing
105
2
C
2021-09-21 12:04:24
Update Field as B
106
3
A
2022-01-23 01:17:18
Update Field as C
107
3
B
2008-04-24 15:17:25
Update Field as C
108
3
C
2010-01-15 07:20:13
Nothing
Mapping Queries
Source Qualifier Query
SELECT *
FROM Table1 t1, Table2 t2
WHERE t1.Id = t2.Table1_Id AND t1.Field <> t2.Field
Update Transformation Query
UPDATE Table2
SET
Field = :tu.Field,
Timestamp = SYSDATE
WHERE Id = :tu.Id
You can use below approach.
SQ - Your SQL is correct and you can use it if you see its working but add a <> clause on partition date key column. You can use this SQL to speed it up as well.
SELECT *
FROM Table2 t2
INNER JOIN Table1 t3 ON t3.Id = t2.Table1_Id
LEFT OUTER JOIN Table1 t1 ON t1.Id = t2.Table1_Id AND t1.Field = t2.Field AND t1.partition_date= t2.partition_date -- You did not mention partition_date column but i am assuming there is a separate column which is used to partition.
WHERE t1.id is null -- <> is inefficient.
Then in your infa target T2 definition, make sure you mention partition_date as part of key along with ID.
Then use a update strategy set to DD_UPDATE. You can set the session to update as well.
And remove that target override. This actually applies the update query on the whole table and sometime can be inefficient abd I/O intensive.
Informatica is powerful to update data in bunch through update strategy. You can increase commit interval as per your performance.
You shouldn't try to update a 500M table in a single go using SQL. Yes, you can use PLSQL to update in a bunch.
We have almost 1B records in a replicated merge tree table.
The primary key is a,b,c
Our App keeps writing into this table with every user action. (we accumulate almost a million records per hour)
We append (store) the latest timestamp (updated_at) for a given unique combination of (a,b)
The key requirement is to provide a roll-up against the latest timestamp for a given combination of a,b,c
Currently, we are processing the queries as
select a,b,c, sum(x), sum(y)...etc
from table_1
where (a,b,updated_at) in (select a,b,max(updated_at) from table_1 group by a,b)
and c in (...)
group by a,b,c
clarification on the sub-query
(select a,b,max(updated_at) from table_1 group by a,b)
^ This part is for illustration only.. our app writes latest updated_at for every a,b implying that the clause shown above is more like
(select a,b,updated_at from tab_1_summary)
[where tab_1_summary has latest record for a given a,b]
Note: We have to keep the grouping criteria as-is.
The table is structured with partition (c) order by (a, b, updated_at)
Question is, is there a way to write a better query. (that can returns results faster..we are required to shave off few seconds from the overall processing)
FYI: We toyed working with Materialized View ReplicatedReplacingMergeTree. But, given the size of this table, and constant inserts + the FINAL clause doesn't necessarily work well as compared to the query above.
Thanks in advance!
Just for test try to use join instead of tuple in (tuples):
select t.a, t.b, t.c, sum(x), sum(y)...etc
from table_1 AS t inner join tab_1_summary using (a, b, updated_at)
where c in (...)
group by t.a, t.b, t.c
Consider using AggregatingMergeTree to pre-calculate result metrics:
CREATE MATERIALIZED VIEW table_1_mv
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(updated_at)
ORDER BY (updated_at, a, b, c)
AS SELECT
updated_at,
a,b,c,
sum(x) AS x, /* see [SimpleAggregateFunction data type](https://clickhouse.tech/docs/en/sql-reference/data-types/simpleaggregatefunction/) */
sum(y) AS y,
/* For non-simple functions should be used [AggregateFunction data type](https://clickhouse.tech/docs/en/sql-reference/data-types/aggregatefunction/). */
// etc..
FROM table_1
GROUP BY updated_at, a, b, c;
And use this way to get result:
select a,b,c, sum(x), sum(y)...etc
from table_1_mv
where (updated_at,a,b) in (select updated_at,a,b from tab_1_summary)
and c in (...)
group by a,b,c
This is a question regarding updating a new column in a Hive table. Since I think Hive does not allow to update a column of an existing table using subqueries, I wanted to ask what will be the best way to achieve the following update operation.
I have the following two example tables:
Table A:
KeyId ValId Val
W1 V1 10
W2 V2 20
Table B:
KeyId ValId Val
W1 V1 10
W1 V1 30
W1 V3 40
W1 V4 50
W2 V2 0
W2 V2 50
W2 V2 70
W2 V4 80
I want to create another column in Table A, lets say avgVal that takes the KeyId and ValId in each row in Table A and takes the average of Val for those corresponding KeyId and ValId in Table B. Thus, my final output table should look like:
Updated Table A:
KeyId ValId Val avgVal
W1 V1 10 20
W2 V2 20 40
Please let me know if the question is not clear.
It seems you are trying to get aggregate values in table A from table B. In that case you cannot have "val" column in table A because after aggregation which val from table B do you expect in table A?
Assuming that was genuine mistake, and you remove "val" column from table a, your insert statement for table a should look like this:
insert into table table_a select keyid,valid,avg(val) from table_b group by keyid,valid
You can use below query to get avg of data in Table_B corresponding to row in table_A :-
select t1.keyid, t1.valid , t1.val , avgval from table_A t1 left join
(select keyid as k , valid as v, avg(val) as avgval from Table_B group by keyid,valid )temp
on k=t1.keyid and t1.valid=v;
You have to check the table_A is updatable to change the schema else you can make other table to load the data.
I have a table with Lots of cost columns for each Key
TableA
SK1 SK2 Col1 Col2 Col3..... Col50 Flg(Y/N)
1 2 10 20 30 ...... 500 Y
1 2 10 20 30 ...... 500 N
2 2 10 20 30 ...... 500 N
I need to aggregate(sum) of all values and then check if there are any values with Y then add them to new tableB.
Here table A record combination (1,2) for (sk1,sk2) should be returned.
The i have written query is to select lisr of all cols and add as group by.
We have lots of data so this query is taking too long to run. Any chance to relook into this and do so that it can become faster.
select
Sk1,
Sk2,
nvl(sum(col3),0),
nvl(sum(col4))0,
.....
nvl(sum(col50))
from table A
group by Sk1,
Sk2
Iam using this as part of large query where in many other calculations are performed on top of this.
Working out whether any of a grouped set of records contains a 'Y' would be as simple as ...
select ...
from ...
group by ...
having max(flg) = 'Y'
For now i have created a temporary table and have loaded all the data into it.
If you are using this as part of large query, did you try WITH option?
It could be like this
WITH SUM_DATA AS (select col1, col2, nvl(sum(col3),0), nvl(sum(col4))0, ..... nvl(sum(col50)) from table A group by col1, col2)
SELECT xyz
FROM abc, sum_data
WHERE abc.join_col = sum_data.join_col
More help here
I have two tables which I am trying to join based on two criteria. One of the criteria is that a date from t1 is between a date in t2 and the next date in t2. The other is that the name from t1 matches the name from t2.
I.e. if t2 looks like this:
Record Name Date
1 A1234 2016-01-03 04:58:00
2 A1234 2015-12-15 08:34:00
3 A5678 2016-01-04 03:14:00
4 A1234 2016-01-05 21:06:00
Then:
Any records from t1 for Name A1234 with dates between 2016-01-03 04:58:00 and 2016-01-05 21:06:00 would be joined to record 1.
Any records from t1 for Name A1234 with dates between 2015-12-15 08:34:00 and 2016-01-03 04:58:00 would be joined to record 2
Any records from t1 for A1234 after the date of record 4 would be joined to record 4
Any records from t1 for A5678 would be joined to record 3 because there's only one date.
My initial approach is to use a correlated subquery to find the next date. However, due to a large number of records, I determined this would take over a year to execute because it searches all of t2 for the next later date during each iteration. Original SQLite:
CREATE TABLE outputtable AS SELECT * FROM t1, t2 d
WHERE t1.Name = d.Name AND t1.Date BETWEEN d.Date AND (
SELECT * FROM (
SELECT Date from t2
WHERE t2.Name = d.Name
ORDER BY Date ASC )
WHERE Date > d.Date
LIMIT 1 )
Now, I would like to find the next date only once for all records in t2 and create a new column in t2 that contains the next date. This way, I only search for the next date about 400,000 times instead of 56 billion times, significantly improving my performance.
Thus the output of the query I'm looking for would make t2 look like this:
Record Name Date Next_Date
1 A1234 2016-01-03 04:58:00 2016-01-05 21:06:00
2 A1234 2015-12-15 08:34:00 2016-01-03 04:58:00
3 A5678 2016-01-04 03:14:00 2999-12-31 23:59:59
4 A1234 2016-01-05 21:06:00 2999-12-31 23:59:59
Then I would be able to simply query whether t1.Date is between t2.Date and t2.Next_Date.
How can I build a query that will add the next date to a new column in t2?
Rather than add the new column, you should just be able to use a query like the one below to join the tables:
SELECT
T1.*,
T2_1.*
FROM
T1
INNER JOIN T2 T2_1 ON
T2_1.Name = T1.Name AND
T2_1.some_date < T1.some_date
LEFT OUTER JOIN T2 T2_2 ON
T2_2.Name = T1.Name AND
T2_2.some_date > T2_1.some_date
LEFT OUTER JOIN T2 T2_3 ON
T2_3.Name = T1.Name AND
T2_3.some_date > T2_1.some_date AND
T2_3.some_date < T2_2.some_date
WHERE
T2_3.Name IS NULL
You can do the same with NOT EXISTS, but this method often has better performance.
You can speed up (sub)queries by using proper indexes.
To check which indexes are actually used, use EXPLAIN QUERY PLAN.
Your original query, without any indexes, would be executed by SQLite 3.10.0 like this:
0|0|0|SCAN TABLE t1
0|1|1|SEARCH TABLE t2 AS d USING AUTOMATIC COVERING INDEX (name=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SCAN TABLE t2
1|0|0|USE TEMP B-TREE FOR ORDER BY
(The "automatic" index is created temporarily just for this query; the optimizer has estimated that this would still be faster than not using any index.)
In this case, you get the most optimal query plan by indexing all columns used for lookups:
create index i1nd on t1(name, date);
create index i2nd on t2(name, date);
0|0|1|SCAN TABLE t2 AS d
0|1|0|SEARCH TABLE t1 USING INDEX i1nd (name=? AND date>? AND date<?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE t2 USING COVERING INDEX i2nd (name=? AND date>?)
I've used this method on tables with around 1 mil rows with success. Obviously, creating an index that will cover this query will help performance.
This approach uses RANK to create a value to join against. After creating the RANK in a CTE (I use this for readability reasons, please correct for style or personal preference), use a sub-query to join rnk to rnk + 1; aka the next date.
Here's an example of what the code looks like using your sample values.
IF OBJECT_ID('tempdb..#T2') IS NOT NULL
DROP TABLE #T2
CREATE TABLE #T2
(
Record INT NOT NULL PRIMARY KEY,
Name VARCHAR(10),
[DATE] DATETIME,
)
INSERT INTO #T2
VALUES (1, 'A1234', '2016-01-03 04:58:00'),
(2, 'A1234', '2015-12-15 08:34:00'),
(3, 'A5678', '2016-01-04 03:14:00'),
(4, 'A1234', '2016-01-05 21:06:00');
WITH Rank_Dates
AS (Select *
,rank() OVER(PARTITION BY #t2.name ORDER BY #t2.date DESC) AS rnk
FROM #T2)
select RD1.Record,
RD1.Name,
RD1.DATE,
COALESCE (RD2.DATE, '2999-12-31 23:59:59') AS NEXT_DATE
FROM Rank_Dates RD1
LEFT JOIN Rank_Dates RD2
ON RD1.rnk = RD2.rnk + 1
AND RD1.Name = RD2.Name
ORDER BY RD1.Record -- ORDER BY is optional
;
EDIT: added code output below.
The code above produces the following output.
Record Name DATE NEXT_DATE
1 A1234 2016-01-03 04:58:00.000 2016-01-05 21:06:00.000
2 A1234 2015-12-15 08:34:00.000 2016-01-03 04:58:00.000
3 A5678 2016-01-04 03:14:00.000 2999-12-31 23:59:59.000
4 A1234 2016-01-05 21:06:00.000 2999-12-31 23:59:59.000
On a random note. Would using the CURRENT_TIMESTAMP in place of hard coding '2999-12-31 23:59:59.000' produce a similar result?