I have created table and trying to insert the values multiple time to check the duplicates. I can see duplicates are inserting. Is there a way to avoid duplicates in clickhouse table?
CREATE TABLE sample.tmp_api_logs ( id UInt32, EventDate Date)
ENGINE = MergeTree(EventDate, id, (EventDate,id), 8192);
insert into sample.tmp_api_logs values(1,'2018-11-23'),(2,'2018-11-23');
insert into sample.tmp_api_logs values(1,'2018-11-23'),(2,'2018-11-23');
select * from sample.tmp_api_logs;
/*
┌─id─┬──EventDate─┐
│ 1 │ 2018-11-23 │
│ 2 │ 2018-11-23 │
└────┴────────────┘
┌─id─┬──EventDate─┐
│ 1 │ 2018-11-23 │
│ 2 │ 2018-11-23 │
└────┴────────────┘
*/
Most likely ReplacingMergeTree is what you need as long as duplicate records duplicate primary keys. You can also try out other MergeTree engines for more actions when replicate record is encountered. FINAL keyword can be used when doing queries to ensure uniquity.
If raw data does not contain duplicates and they might appear only during retries of INSERT INTO, there's a deduplication feature in ReplicatedMergeTree. To make it work you should retry inserts of exactly the same batches of data (same set of rows in same order). You can use different replica for these retries and data block will still be inserted only once as block hashes are shared between replicas via ZooKeeper.
Otherwise, you should deduplicate data externally before inserts to ClickHouse or clean up duplicates asynchronously with ReplacingMergeTree or ReplicatedReplacingMergeTree.
Related
I'm trying to create one table from another using
CREATE TABLE IF NOT EXISTS new_data ENGINE = ReplicatedReplacingMergeTree(/clickhouse/fedor/tables/{shard}/subfolder/new_data', '{replica}')
ORDER BY created_at
SETTINGS index_granularity = 8192, allow_nullable_key=TRUE
AS
SELECT *
FROM table
WHERE column IS NOT NULL
When I use
ENGINE = ReplicatedReplacingMergeTree('/clickhouse/fedor/tables/{shard}/subfolder/new_data', '{replica}'),
i've got around 7-9% of expected number of rows i've got from query SELECT...FROM...WHERE
When I use
ENGINE = ReplicatedMergeTree('/clickhouse/fedor/tables/{shard}/subfolder/new_data', '{replica}')
i've got 3 times more than expected (I assume every row occur exactly 3 times)
I would like to have exact number of rows without losses and without duplication
ReplicatedReplacingMergeTree with ORDER BY created_at
will replace many rows with the same created_at value to one rows
How did you delete exists table data before create
ReplicatedMergeTree('/clickhouse/fedor/tables/{shard}/subfolder/new_data'...)?
Did you use DROP TABLE new_data SYNC?
Which engine do you have for table?
Can I find the Sorting key i.e. Order BY key or Primary Key used at a time of Table creation in Clickhouse?
Same goes for if I want to find the Table Engine used for creation, How can I find it?
You can use system.tables
SELECT
sorting_key,
engine_full
FROM system.tables
WHERE (database = '<database_name>') AND (name = '<table_name>')
I found with the help of another answer.
You can use the command
SHOW CREATE TABLE db.table ;
which outputs something like
│ CREATE TABLE db.table
(
`field1` Int64,
`field2` Int64,
`field3` DateTime,
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(field3)
PRIMARY KEY field1
ORDER BY (field1)
SETTINGS xxxx │
which results in showing the command used at the time of table creation.
The situation is that: I insert some data into ReplicatedMergeTree engine table, and I do this query select count(1) from table at once, and I get different results. As I know, this is caused by the Replicated mechanism, It will spend some time for Replicated shard copying data, so if the query routes to Replicated shard and will respond different result.
How can I avoid this problem if I want to use the data I insert at once?
For reading from Distributed table you can play with the next settings:
insert_quorum setting for INSERT queries. For example, for 3 replicas you specify insert_quorum = 3, so client will wait until data is replicates across all 3 replicas. https://clickhouse.com/docs/en/operations/settings/settings/#settings-insert_quorum
select_sequential_consistency. Setting for SELECT queries. Select will include the data written with insert_quorum. https://clickhouse.com/docs/en/operations/settings/settings/#settings-select_sequential_consistency
We have a ClickHouse table with CollapsingMergeTree engine. We want to update the records as and when data is imported from source(May be frequent). Initially all the records are inserted with +1 sign column. What we do to update a record is we insert the record to be updated with same values with -1 sign and then insert an updated record with +1 sign expecting that the same records with opposite signs will be collapsed by ClickHouse when data parts are merged in background.
The problem is 'It never happens'
I am aware of the fact that ClickHouse will merge data asynchronously but its been months and no merge is performed by ClickHouse.
I queried SELECT * FROM system.merges to know if any merges are in progress. Query result was 0. Also updated ClickHouse to its latest version. But no luck!
Would appreicate your help if anyone can point out what's the issue? Am I missing on any server level settings? When ClickHouse merges such records?
Or any other approach I should be taking to update ClickHouse data?
eventually -- could be never.
You should not rely on merge process. It has own complicated algorithm to balance number of parts.
Merge has no goal to do final merge -- to make 1 part because it's not efficient.
optimize does / forces unnplaned merge, and you'll get for example 4 parts from 22
optimize final does / forces unnplaned merges for all parts until only one part in partition
The only problem with final that it rewrites even 1 part to a new 1 part. Because sometimes it needed to collapse rows over a single part.
So for some tables we run optimize table x partition final (by crond) for partitions which have more than one part.
select concat('optimize table ',database, '.','\`', table, '\` partition ', ((groupUniqArray(partition)) as partition_count)[1], ' final')
from system.parts
where (database = 'xxx' or database like 'zzzz\_%') and (database,table) in (select database, table from system.replicas where engine='ReplicatedReplacingMergeTree' and is_leader)
and table not like '.inner%'
group by database,table having length(partition_count)>1 and sum(rows) < 5000000
https://stackoverflow.com/a/60154073/11644308
I have a bucketed hive table. It has 4 buckets.
CREATE TABLE user(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
CLUSTERED BY(user_id) INTO 4 BUCKETS;
Initially i have inserted some records into this table using the following query.
set hive.enforce.bucketing = true;
insert into user
select * from second_user;
After this operation In HDFS I see that 4 files are created under this table dir.
Again i needed to insert another set of data into user table. So i ran the below query.
set hive.enforce.bucketing = true;
insert into user
select * from third_user;
Now another 4 files are crated under user folder dir. Now it has total 8 files.
Is this fine to do this kind of multiple inserts into a bucketed table?
Does it affect the bucketing of the table?
I figured it out!!
Actually if you do multiple inserts on a bucketed hive table. Hive wont complain as such.
All hive queries will work fine.
Having said that, Such operation spoils the bucketing concept of the table. I mean after multiple inserts into a bucketed table the sampling fails.
The TABLASAMPLE doesnt work properly after multiple inserts.
Even sort merge bucket map join also doesnt work after such operation.
I dont think that should be a issue because you have declared that you want bucketing on user_id. so every time you would insert it will create 4 more files.
Bucketing is used for faster query processing so if it is making 4 more files everytime it will be making your query processing even faster.