I am using Clickhouse to store raw data in a MergeTree. I actually need data in a Summingmergetree where columns are summed up based on primary key.
I need to know if clickhouse provides a way where data is inserted automatically into the summinmergetree table as soon as data enters into MergeTree table?
You can use MATERIALIZED VIEW to achieve that. Support you have a raw_data with the following definition:
CREATE TABLE raw_data (key int, i int, j int) engine MergeTree ORDER BY key;
Then you can define the SummingMergeTree table like this:
CREATE MATERIALIZED VIEW summing_data (key int, i int, j int) engine SummingMergeTree((i, j)) ORDER BY key AS SELECT * from raw_data;
Related
I'm trying to sample rows from a clickhouse table.
Below you can find the table definition
CREATE TABLE trades
(
`id` UInt32,
`ticker_id` UUID,
`epoch` DateTime,
`nanoseconds` UInt32,
`amount` Float64,
`cost` Float64,
`price` Float64,
`side` UInt8
)
ENGINE = MergeTree
PARTITION BY (ticker_id, toStartOfInterval(epoch, toIntervalHour(1)))
ORDER BY (ticker_id, epoch)
SETTINGS index_granularity = 8192
I want to sample 10000 rows from the table
SELECT * FROM trades SAMPLE 10000 ;
But when I'm trying to run the query above I'm getting the following error:
DB::Exception: Illegal SAMPLE: table doesn't support sampling
I want to ALTER the table in order to be able to sample from it, but at the same time I want to make sure that I won't corrupt the data while altering the table.
The table has about 1 billion rows.
What would be a good way to ALTER the table while making sure the data won't get corrupted?
you can try:
ALTER TABLE trades MODIFY SAMPLE BY toUnixTimestamp(toIntervalHour(epoch));
I have a Clickhouse server with an Engine=Kafka table with Nested fields and kafka_handle_error_mode='stream',input_format_import_nested_json=1 settings and two materialized views:
one for _error='' case which stores data into underlaying table with same structure as Engine=Kafka table
one for _error!='' which stores raw messages and error in case of 'wrong' data
The problem is when clickhouse gets message from kafka with different Nested column lengths (e.g. {"n":{"a":["1","2"], "b":["3"]}}) it goes through Engine=Kafka table without generating an _error and gets stuck on table insert (and hangs entire save loop) because Kafka table doesn't check Nested columns length but target table does.
There is a flatten_nested=0 setting which seems to change the Nested behavior but it demands different json structure which is unacceptable for my case. Is there a workaround for that?
Kafka engine does not check the sizes of nested arrays because it's a limitation of MergeTree table.
The structure of Kafka engine is not necessarily to match with MergeTree table. Just add corresponding transformation / check in Materialized View SELECT.
Example:
create table n ( a Nested( n1 int, n2 int ) ) Engine=Kafka ....;
create table m ( an1 Array(int), an2 Array(int) ) Engine = MergeTree order by tuple();
create materialized view m_mv to m
as select
a.n1 as an1,
a.n2 as an2
from n;
I am also interested to know if there is a clean/performant way to solve it.
A suggestion that might fit your use case would be to store a string in the Kafka table.
In the materialized view, you can use Json Functions and filter errors ...
CREATE TABLE kafka(
payload String
) engine = Kafka
SETTINGS kafka_broker_list = 'localhost:9092',
kafka_topic_list = 'nested',
kafka_group_name = 'nested',
kafka_format = 'JSONAsString',
kafka_num_consumers = 1,
kafka_handle_error_mode = 'stream';
create materialized view consumer
to valid_payload_table
as select
JSONExtract(payload, 'a', 'String') as a,
JSONExtract(payload, 'b', 'String') as b,
from kafka
where _error!=''
I have a table with schema
CREATE TABLE traffic (
date Date,
val1 UInt64,
'val2' UInt64
...
) ENGINE = ReplicatedMergeTree(date, (val1, val2), 8192);
the partition key is date here. I want to change the order from (val1, val2) to (val2, val1)
I only way i know is rename this table to someting(traffic_temp), create table with name 'trafic' and ordering (val2, val1) and copy the data from temp to traffic and then delete the temp table.
But the dataset is huge, is there any better way to do it??
No other way. Only insert select.
You can use clickhouse-copier but it does the same insert select
I am having multiple products and each of them are having there own Product table and Value table. Now I have to create a generic screen to validate those product and I don't want to create validated table for each Product. I want to create a generic table which will have all the Products details and one extra column called ProductIdentifier. but the problem is that here in this generic table I may end up putting millions of records and while fetching the data it will take time.
Is there any other better solution???
"Millions of records" sounds like a VLDB problem. I'd put the data into a partitioned table:
CREATE TABLE myproducts (
productIdentifier NUMBER,
value1 VARCHAR2(30),
value2 DATE
) PARTITION BY LIST (productIdentifier)
( PARTITION p1 VALUES (1),
PARTITION p2 VALUES (2),
PARTITION p5to9 VALUES (5,6,7,8,9)
);
For queries that are dealing with only one product, specify the partition:
SELECT * FROM myproducts PARTITION FOR (9);
For your general report, just omit the partition and you get all numbers:
SELECT * FROM myproducts;
Documentation is here:
https://docs.oracle.com/en/database/oracle/oracle-database/12.2/vldbg/toc.htm
I was trying to create Partition and buckets using HIVE.
For setting some of the properties:
set hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Below is the code for creating the table:
CREATE TABLE transactions_production
( id string,
dept string,
category string,
company string,
brand string,
date1 string,
productsize int,
productmeasure string,
purchasequantity int,
purchaseamount double)
PARTITIONED BY (chain string) clustered by(id) into 5 buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Below is the code for inserting data into the table:
INSERT OVERWRITE TABLE transactions_production PARTITION (chain)
select id, dept, category, company, brand, date1, productsize, productmeasure,
purchasequantity, purchaseamount, chain from transactions_staging;
What went wrong:
Partitions and buckets are getting created in HDFS but the data is present only in the 1st bucket of all the partitions; all the remaining buckets are empty.
Please let me know what i did wrong and how to resolve this issue.
When using bucketing, Hive comes up with a hash of the clustered by value (here you use id) and splits the table into that many flat files inside partitions.
Because the table is split up by the hashes of the id's the size of each split is based on the values in your table.
If you have no values that would get mapped to the buckets other than the first bucket, all those flat files will be empty.