How to change order of composed primary key in clickhouse efficiently - clickhouse

I have a table with schema
CREATE TABLE traffic (
date Date,
val1 UInt64,
'val2' UInt64
...
) ENGINE = ReplicatedMergeTree(date, (val1, val2), 8192);
the partition key is date here. I want to change the order from (val1, val2) to (val2, val1)
I only way i know is rename this table to someting(traffic_temp), create table with name 'trafic' and ordering (val2, val1) and copy the data from temp to traffic and then delete the temp table.
But the dataset is huge, is there any better way to do it??

No other way. Only insert select.
You can use clickhouse-copier but it does the same insert select

Related

How to decide the partition key for clickhouse

I want to know what's the best practice for the partition key.
In my project, we have a table with event_date, app_id and other columns. The app_id will be growing and could be thousands.
The select query is based on event_date and app_id.
The simple data schema is as below:
CREATE TABLE test.test_custom_partition (
company_id UInt64,
app_id String,
event_date DateTime,
event_name String ) ENGINE MergeTree()
PARTITION BY (toYYYYMMDD(event_date), app_id)
ORDER BY (app_id, company_id, event_date)
SETTINGS index_granularity = 8192;
the select query is like below:
select event_name from test_custom_partition
where event_date >= '2020-07-01 00:00:00' AND event_date <= '2020-07-15 00:00:00'
AND app_id = 'test';
I want to use (toYYYYMMDD(event_date), app_id) as the partition key, as the query could read the minimal data parts. But it could cause the partitions more than 1000, from the document I see
A merge only works for data parts that have the same value for the
partitioning expression. This means you shouldn't make overly granular
partitions (more than about a thousand partitions). Otherwise, the
SELECT query performs poorly because of an unreasonably large number
of files in the file system and open file descriptors.
Or should I use the partition key only toYYYYMMDD(event_date)?
also, could anyone explain why the partition shouldn't more than 1000 partitions? even if the query only use a small set of the data part, it still could cause performance issue?
Thanks

How to change PARTITION in clickhouse

version 18.16.1
CREATE TABLE traffic (
`date` Date,
...
) ENGINE = MergeTree(date, (end_time), 8192);
I want to change as PARTITION BY toYYYYMMDD(date) without drop table how to do that.
Since ALTER query does not allow the partition alteration, the possible way is to create a new table
CREATE TABLE traffic_new
(
`date` Date,
...
)
ENGINE = MergeTree(date, (end_time), 8192)
PARTITION BY toYYYYMMDD(date);
and to move your data
INSERT INTO traffic_new SELECT * FROM traffic WHERE column BETWEEN x and xxxx;
Rename the final table if necessary.
And yes, this option involves deleting the old table (seems to be no way to skip this step)

Clickhouse Automatically insert data into Summingmergetree from Mergetree

I am using Clickhouse to store raw data in a MergeTree. I actually need data in a Summingmergetree where columns are summed up based on primary key.
I need to know if clickhouse provides a way where data is inserted automatically into the summinmergetree table as soon as data enters into MergeTree table?
You can use MATERIALIZED VIEW to achieve that. Support you have a raw_data with the following definition:
CREATE TABLE raw_data (key int, i int, j int) engine MergeTree ORDER BY key;
Then you can define the SummingMergeTree table like this:
CREATE MATERIALIZED VIEW summing_data (key int, i int, j int) engine SummingMergeTree((i, j)) ORDER BY key AS SELECT * from raw_data;

Create a generic DB table

I am having multiple products and each of them are having there own Product table and Value table. Now I have to create a generic screen to validate those product and I don't want to create validated table for each Product. I want to create a generic table which will have all the Products details and one extra column called ProductIdentifier. but the problem is that here in this generic table I may end up putting millions of records and while fetching the data it will take time.
Is there any other better solution???
"Millions of records" sounds like a VLDB problem. I'd put the data into a partitioned table:
CREATE TABLE myproducts (
productIdentifier NUMBER,
value1 VARCHAR2(30),
value2 DATE
) PARTITION BY LIST (productIdentifier)
( PARTITION p1 VALUES (1),
PARTITION p2 VALUES (2),
PARTITION p5to9 VALUES (5,6,7,8,9)
);
For queries that are dealing with only one product, specify the partition:
SELECT * FROM myproducts PARTITION FOR (9);
For your general report, just omit the partition and you get all numbers:
SELECT * FROM myproducts;
Documentation is here:
https://docs.oracle.com/en/database/oracle/oracle-database/12.2/vldbg/toc.htm

How to use insert statement for a Hive partitioned table?

I have a hive table dynpart.
id int
name char(30)
city char(30)
thisday string
# Partition Information
# col_name data_type comment
thisday string
It is partitioned by 'thisday' whose datatype is STRING.
How can I insert a single record into the table in a particular partition. I know there is load command to load an entire file data into hive table. I just want to know how an Insert statement can be written for a partitioned table. I tried to write command like below but this is taking data from another table.
insert into droplater partition(thisday='30/03/2017') select * from dynpart;
The table: Droplater has the same structure as dynpart. But the above command is to insert the data from another table. What I'd like to learn is to write a simple insert command into a partition, like: insert into tabname values(1,"abcd","efgh");into the table.
This will work for primitive types only (no arrays, structs etc.)
insert into tabname partition (thisday='30/03/2017') values (1,"abcd","efgh");
This will work for all types
insert into tabname partition (thisday='30/03/2017') select 1,"abcd","efgh";
P.s.
By all means, partition your table by date ((thisday date) )
insert into tabname partition (thisday=date '2017-03-30') ...
or at least use the ISO date format
insert into tabname partition (thisday='2017-03-30') ...

Resources