clickhouse alter MATERIALIZED VIEW add column - clickhouse

env: Clikchouse version:22.3.3.44; Database engine: atomic
I have a raw table and mv, schema like this:
CREATE TABLE IF NOT EXISTS test.Income_Raw on cluster '{cluster}' (
Id Int64,
DateNum Date,
Cnt Int64,
LoadTime DateTime
) ENGINE==MergeTree
PARTITION BY toYYYYMMDD(LoadTime)
ORDER BY (Id, DateNum);
CREATE MATERIALIZED VIEW test.Income_MV on cluster '{cluster}'
ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/{shard}/{database}/{table}', '{replica}')
PARTITION BY toYYYYMM(DateNum)
ORDER BY (Id, DateNum)
TTL DateNum+ INTERVAL 100 DAY
AS SELECT
DateNum,
Id,
argMaxState(Cnt, LoadTime) as Cnt ,
maxState( LoadTime) as latest_loadtime
FROM test.Income_Raw
GROUP BY Id, DataNum;
now I want to add a column named 'price' to raw table and mv,
so I run sql step by step like below:
// first I alter raw table
1. alter table test.Income_Raw on cluster '{cluster}' add column Price Int32
// below sqls, I run to alter MV
2. detach test.Income_MV on cluster '{cluster}'
3. alter test.`.inner_id.{uuid}` on cluster '{cluster}' add column Price Int32
// step 4, basically I just use 'attach' replace 'create' and add 'Price' to select query
4. attach MATERIALIZED VIEW test.Income_MV on cluster '{cluster}'
ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/{shard}/{database}/{table}', '{replica}')
PARTITION BY toYYYYMM(DateNum)
ORDER BY (Id, DateNum)
TTL DateNum+ INTERVAL 100 DAY
AS SELECT
DateNum,
Id,
Price,
argMaxState(Cnt, LoadTime) as Cnt ,
maxState( LoadTime) as latest_loadtime
FROM test.Income_Raw
GROUP BY Id, DataNum, Price;
but at step 4, I met error like this
Code: 80. DB::Exception: Incorrect ATTACH TABLE query for Atomic database engine. Use one of the following queries instead:
1. ATTACH TABLE Income_MV;
2. CREATE TABLE Income_MV <table definition>;
3. ATTACH TABLE Income_MV FROM '/path/to/data/' <table definition>;
4. ATTACH TABLE Income_MVUUID '<uuid>' <table definition>;. (INCORRECT_QUERY) (version 22.3.3.44 (official build))
these sqls I runned is I followed from below references.
https://kb.altinity.com/altinity-kb-schema-design/materialized-views/
Clickhouse altering materialized view's select
so my question is, how to modify mv select query, which step I was wrong?

I figure out that just need:
prepare: use explicit target table instead inner table for MV
1 alter MV target table
2 drop MV
3 re-create MV with new query

Related

How to change PARTITION in clickhouse

version 18.16.1
CREATE TABLE traffic (
`date` Date,
...
) ENGINE = MergeTree(date, (end_time), 8192);
I want to change as PARTITION BY toYYYYMMDD(date) without drop table how to do that.
Since ALTER query does not allow the partition alteration, the possible way is to create a new table
CREATE TABLE traffic_new
(
`date` Date,
...
)
ENGINE = MergeTree(date, (end_time), 8192)
PARTITION BY toYYYYMMDD(date);
and to move your data
INSERT INTO traffic_new SELECT * FROM traffic WHERE column BETWEEN x and xxxx;
Rename the final table if necessary.
And yes, this option involves deleting the old table (seems to be no way to skip this step)

Loading Data into an empty Impala Table with account data partitioned by area code

I'm trying to copy data from a table called accounts into an empty table called accounts_by_area_code. I have the following fields in accounts_by_area_code: acct_num INT, first_name STRING, last_name STRING, phone_number STRING. The table is partitioned by areacode (the first 3 digits of phone_number.
I need to use a SELECT statement to extract the area code into an INSERT INTO TABLE command to copy the speciļ¬ed columns to the new table, dynamically partitioning by area code.
This is my last attempt:
impala-shell -q "INSERT INTO TABLE accounts_by_areacode (acct_num, first_name, last_name, phone_number, areacode) PARTITION (areacode) SELECT STRLEFT (phone_number,3) AS areacode FROM accounts;"
This generates ERROR: AnalysisException: Column permutation and PARTITION clause mention more columns (5) than the SELECT / VALUES clause and PARTITION clause return (1). I'm not convinced I have even the basic syntax correct so any help would be great as I'm new to Impala.
Impala creates partitions dynamically based on data. So not sure why you want to create an empty table with partitions because it will be auto created while inserting new data.
Still, I think you can create empty table with partitions like this-
impala-shell -q "INSERT INTO TABLE accounts_by_areacode (acct_num) PARTITION (areacode)
SELECT CAST(NULL as STRING), STRLEFT (phone_number,3) AS areacode FROM accounts;"

Hive insert overwrites truncates the table in few cases

I was working on one solution and found that in some particular cases, hive insert overwrite truncates the table however in few cases it doesn't. Would someone please explain me what it's behaving like that?
to explain this, I am table two tables, source and target and trying to insert data into master from source table using insert overwrite
When Source Table has partition
if source table has partition and if you write a condition such that partition does not exist then it won't truncate the master table.
create table source (name String) partitioned by (age int);
insert into source partition (age) values("gaurang", 11);
create table target (name String, age int);
insert into target partition (age) values("xxx", 99);
following query won't truncate the table even if select doesn't return anything.
insert overwrite table temp.test12 select * from temp.test11 where name="Ddddd" and age=99;
However, following query will truncate the table.
insert overwrite table temp.target select * from temp.test11 where name="Ddddd" and age=11;
it makes sense in the first case, as the partition(age=99) does not exist hence it should stop the execution of the query further. However this is my assumption, not sure what exactly happens.
When Source Table Doesn't have partition, but Target has
in this case target table won't be truncated even if select statement from source table returns 0 rows.
use temp;
drop table if exists source1;
drop table if exists target1;
create table source1 (name String, age int);
create table target1 (name String) partitioned by (age int);
insert into source1 values ("gaurang", 11);
insert into target1 partition(age) values("xxx", 99);
select * from source1;
select * from target1;
Following query won't truncate the table even if no data found in select statement.
insert overwrite table temp.target1 partition(age) select * from temp.source1 where age=90;
When Source or Target don't have partition
In this case if I try to insert overwrite target and select statement doesn't return any row then target table will be truncated.
check the example below.
use temp;
drop table if exists source1;
drop table if exists target1;
create table source1 (name String, age int);
create table target1 (name String, age int);
insert into source1 values ("gaurang", 11);
insert into target1 values("xxx", 99);
select * from source1;
select * from target1;
Following Query will truncate the target table.
insert overwrite table temp.target1 select * from temp.source1 where age=90;
Better use term 'overwrite' instead of truncate, because it is what exactly happening during insert overwrite.
When you write overwrite table temp.target1 partition(age) you instructs Hive to overwrite partitions, not all the target1 table, only those partitions which will be returned by select.
Empty dataset will not overwrite partitions in dynamic partition mode. because the partition to overwrite is unknown, partition should be taken from dataset, and the dataset is empty, nothing to overwrite then.
And in case of not partitioned table, it is already known that it should overwrite all the table, does not matter, empty dataset or not.
Partition column in insert overwrite statement should be the last. And the list of partitions to be overwritten in target = list of values in partition column, returned by dataset, does not matter how the source table is partitioned (you can select target partition column from any source table column, calculate it or use a constant), only what was returned does matter.

Create a generic DB table

I am having multiple products and each of them are having there own Product table and Value table. Now I have to create a generic screen to validate those product and I don't want to create validated table for each Product. I want to create a generic table which will have all the Products details and one extra column called ProductIdentifier. but the problem is that here in this generic table I may end up putting millions of records and while fetching the data it will take time.
Is there any other better solution???
"Millions of records" sounds like a VLDB problem. I'd put the data into a partitioned table:
CREATE TABLE myproducts (
productIdentifier NUMBER,
value1 VARCHAR2(30),
value2 DATE
) PARTITION BY LIST (productIdentifier)
( PARTITION p1 VALUES (1),
PARTITION p2 VALUES (2),
PARTITION p5to9 VALUES (5,6,7,8,9)
);
For queries that are dealing with only one product, specify the partition:
SELECT * FROM myproducts PARTITION FOR (9);
For your general report, just omit the partition and you get all numbers:
SELECT * FROM myproducts;
Documentation is here:
https://docs.oracle.com/en/database/oracle/oracle-database/12.2/vldbg/toc.htm

Use a Materialized View to track latest versions of records

We've got a highly (perhaps over?) normalized table that keeps track of versioned values. It's insert only, no updates.
Example Data:
"ID" "Version" "Value"
1 0 "A_1"
2 0 "B_1"
1 1 "A_2"
3 0 "C_1"
We frequently run queries to pull only the latest values for each ID. As we're hitting millions of rows, we're starting to encounter performance problems. I've been able to prototype improvements using Materialized Views, but have not been able to create them in such a way that they self-refresh "ON COMMIT"
What I've got so far is this (Revised below)
CREATE MATERIALIZED VIEW TABLE_LATEST
BUILD IMMEDIATE
REFRESH FAST
ON COMMIT AS
SELECT T.ID
,T.LAST_VERSION
FROM (
SELECT ID
,MAX(VERSION) OVER (PARTITION BY ID) LAST_VERSION
FROM TABLE
) T
GROUP BY T.ID, T.LAST_VERSION;
Which is now revised, due to feedback:
CREATE MATERIALIZED VIEW TABLE_LATEST
BUILD IMMEDIATE
REFRESH FAST
ON COMMIT AS
SELECT ID
,MAX(VERSION)
FROM TABLE
GROUP BY T.ID;
Which fails with:
ORA-12033: cannot use filter columns from materialized view log on
"SCHEMA"."TABLE"
*Cause: The materialized view log either did not have filter columns
logged, or the timestamp associated with the filter columns was
more recent than the last refresh time.
*Action: A complete refresh is required before the next fast refresh.
Add filter columns to the materialized view log, if required.
It will only 'work' if I change Refresh to Force and remove On Commit. I can't tell if this falls under the 'No Analytics' rule for Materialized Views or if perhaps I've created the Log incorrectly in the first place?
CREATE MATERIALIZED VIEW LOG ON TABLE
LOGGING
WITH SEQUENCE, ROWID, (VALUE)
INCLUDING NEW VALUES;
Table Schema:
CREATE TABLE "TABLE"
(
ID NUMBER(10, 0) NOT NULL
, VERSION NUMBER(10, 0) NOT NULL
, VALUE VARCHAR2(4000 CHAR)
, CONSTRAINT MASTERRECORDFIELDVALUES_PK PRIMARY KEY
(
ID
, VERSION
)
USING INDEX
(
CREATE UNIQUE INDEX TABLE_PK ON TABLE(ID ASC, VERSION ASC)
LOGGING
...
)
ENABLE
)
LOGGING
Am I even on the right track? Would there be a better performing way to pre-calculate latest versions? Or do I just need to get the Log & View settings dialed in?
If you don't need the value associated with the latest version, then you can simply do:
CREATE MATERIALIZED VIEW LOG ON t1
LOGGING
WITH SEQUENCE, ROWID, (val)
INCLUDING NEW VALUES;
create materialized view t1_latest
refresh fast on commit
as
select id,
max(version) latest_version
from t1
group by id;
The test case for this can be found over at Oracle LiveSQL.
Otherwise, you need to create three separate MVs (because you can't have a fast refreshable on commit materialized view that involves keep dense_rank first/last) - as per http://www.sqlsnippets.com/en/topic-12926.html - like so:
Materialized view log on the main table:
CREATE MATERIALIZED VIEW LOG ON t1
LOGGING
WITH SEQUENCE, ROWID, (val)
INCLUDING NEW VALUES;
First materialized view:
create materialized view t1_sub_mv1
refresh fast on commit
as
select id,
max(version) latest_version,
count(version) cnt_version,
count(*) cnt_all
from t1
group by id;
Materialized view log on the first materialized view:
create materialized view log on t1_sub_mv1
with rowid, sequence (id, latest_version, cnt_version, cnt_all)
including new values;
Second materialized view:
create materialized view t1_sub_mv2
refresh fast on commit
as
select id,
version,
max(val) max_val_per_id_version,
count(*) cnt_all
from t1
group by id,
version;
Materialized view log on the first materialized view:
create materialized view log on t1_sub_mv2
with rowid, sequence (id, max_val_per_id_version, cnt_all)
including new values;
Third and final materialized view:
create materialized view t1_main_mv
refresh fast on commit
as
select mv1.id,
mv1.latest_version,
mv2.max_val_per_id_version val_of_latest_version,
mv1.rowid mv1_rowid,
mv2.rowid mv2_rowid
from t1_sub_mv1 mv1,
t1_sub_mv2 mv2
where mv1.id = mv2.id
and mv1.latest_version = mv2.version;
The supporting test case for this can be found over at Oracle LiveSQL.
I am sorry but I can't give you an answer straight away. One reason might be the use of analytic function which are not that well supported by MVs. To analyze the problem you will need to take a look at the capabilities of the materialized view.
DECLARE
-- Local variables here
--
v_sql VARCHAR2(32000) := 'SELECT T.ID
,T.LAST_VERSION
FROM (SELECT ID
,MAX(VERSION) OVER (PARTITION BY ID) LAST_VERSION
FROM TABLE) T
GROUP BY T.ID
,T.LAST_VERSION';
v_msg_arrray SYS.EXPLAINMVARRAYTYPE;
msg SYS.ExplainMVMessage;
BEGIN
-- Test statements here
dbms_mview.explain_mview(mv => v_sql, msg_array => v_msg_arrray);
FOR i IN v_msg_arrray.FIRST..v_msg_arrray.LAST LOOP
msg := v_msg_arrray(i);
DBMS_OUTPUT.put_line('MVOWNER:' || msg.MVOWNER);
DBMS_OUTPUT.put_line('MVNAME:' || msg.MVNAME);
DBMS_OUTPUT.put_line('CAPABILITY_NAME:' || msg.CAPABILITY_NAME);
DBMS_OUTPUT.put_line('POSSIBLE:' || msg.POSSIBLE);
DBMS_OUTPUT.put_line('RELATED_TEXT:' || msg.RELATED_TEXT);
DBMS_OUTPUT.put_line('RELATED_NUM:' || msg.RELATED_NUM);
DBMS_OUTPUT.put_line('MSGNO:' || msg.MSGNO);
DBMS_OUTPUT.put_line('MSGTXT:' || msg.MSGTXT);
DBMS_OUTPUT.put_line('SEQ:' || msg.SEQ);
DBMS_OUTPUT.put_line('----------------------------------------');
END LOOP;
END;
BTW: You can write your query far simpler:
SELECT t.id,
MAX(t.version) AS last_version
FROM table t
GROUP BY t.id;

Resources