There are Replicated shards on the cluster, and due to I can't create a 'ReplicatedJoin' engine table, I create a Distributed Engine table(join_dist) on the Join Engine local table(join_local). After I insert data into local table by proxy, I do this query: select count(1) from join_dist and I find that its result is approximate a half of actual value. I think this query only collect half shards result of cluster. How can I solve this?
You can use a multiplexing feature of Distributed table.
So you create an additional cluster remote_serves where all Clickhouse nodes a replicas in a single shard with internal_replication = false
Then you create a Distributed table using that new cluster.
Then you insert data into Distributed table and Distributed table multiplex inserts and write exactly the same data to all replicas (to all engine=Join tables).
Then you use select count(1) from join -- you don't need to use select count(1) from _dist because all join tables have the same data.
Related
I want to figure out how many threads does ClickHouse use when it inserts some data into a distributed table, is there any configurations or settings to make this? I have a cluster which has 3 servers, a local table(engine: ReplicatedMergeTree) which has 3 shards and 2 replicas, a distributed table which points to this local table. I want to insert some data into this distributed table by synchronous mode(internal_replication=true, insert_distributed_sync=true, insert_quorum=2).
I've read the documents of ClickHouse and known that background_pool_size sets the number of merge threads and async_insert_threads is used in async mode instead of sync mode, which in my case. max_threads is used in query, not in insert. max_insert_threads is used in INSERT SELECT, not in insert.
I think the answer is no, but wanted to check here to be sure.
Is it possible to move data (copy and then delete the source) between two distributed tables in ClickHouse?
Say, I have local tables a and b defined in all of my nodes, and a_dist defined as:
CREATE TABLE IF NOT EXISTS a_dist ON CLUSTER my_cluster_name AS a ENGINE = Distributed(my_cluster_name, default, a, rand())
CREATE TABLE IF NOT EXISTS b_dist ON CLUSTER my_cluster_name AS a ENGINE = Distributed(my_cluster_name, default, b, rand())
Is it possible to move all of the data from a_dist to b_dist directly? Or should I move data in each node from table a to table b?
Thanks!
The simplest way
INSERT INTO b_dist SELECT * FROM a_dist;
TRUNCATE default.a ON CLUSTER 'my_cluster_name';
But it will produce numerous data transfer between the node where you execute this query and other nodes in cluster
for Atomic database engine and clickhouse 21.8+, run directly on each node will much faster
EXCHANGE TABLES default.a AND default.b
The situation is that: I insert some data into ReplicatedMergeTree engine table, and I do this query select count(1) from table at once, and I get different results. As I know, this is caused by the Replicated mechanism, It will spend some time for Replicated shard copying data, so if the query routes to Replicated shard and will respond different result.
How can I avoid this problem if I want to use the data I insert at once?
For reading from Distributed table you can play with the next settings:
insert_quorum setting for INSERT queries. For example, for 3 replicas you specify insert_quorum = 3, so client will wait until data is replicates across all 3 replicas. https://clickhouse.com/docs/en/operations/settings/settings/#settings-insert_quorum
select_sequential_consistency. Setting for SELECT queries. Select will include the data written with insert_quorum. https://clickhouse.com/docs/en/operations/settings/settings/#settings-select_sequential_consistency
I'm using Oracle 11g. I have a query that joins local table with remote tables using db links. I want the driving table to be the remote table as I primarily filter using remote table to get a few rows. I then want to join them with local table.
The problem is the optimizer ignores ORDERED and INDEX hints and does a full table scan of the local table. I am using the right indexes and have generated statistics. I run the queries individually with each table they use the correct indexes, but with the join, the local table always does a full table scan and acts as the driving table.
SELECT /*+ INDEX_RS_ASC(l) */
*
FROM remote_table#mylink r
JOIN local_table l USING (cont_id)
WHERE r.PRIME_VENDOR_ID = '12345'
My target is to perform a SELECT query using Hive
When I have a small data on a single machine (namenode), I start by:
1-Creating a table that contains this data: create table table1 (int col1, string col2)
2-Loading the data from a file path: load data local inpath 'path' into table table1;
3-Perform my SELECT query: select * from table1 where col1>0
I have huge data, of 10 millions rows that doesn't fit into a single machine. Lets assume Hadoop divided my data into for example 10 datanodes and each datanode contains 1 million row.
Retrieving the data to a single computer is impossible due to its huge size or would take alot of time in case it is possible.
Will Hive create a table at each datanode and perform the SELECT query
or will Hive move all the data a one location (datanode) and create one table? (which is inefficient)
Ok, so I will walk through what happens when you load data into Hive.
The 10 million line file will be cut into 64MB/128MB blocks.
Hadoop, not Hive, will distribute the blocks to the different slave nodes on the cluster.
These blocks will be replicated several times. Default is 3.
Each slave node will contain different blocks that make up the original file, but no machine will contain every block. However, since Hadoop replicates the blocks there must be at least enough empty space on the cluster to accommodate 3x the file size.
When the data is in the cluster Hive will project the table onto the data. The query will be run on the machines Hadoop chooses to work on the blocks that make up the file.
10 million rows isn't that large though. Unless the table has 100 columns you should be fine in any case. However, if you were to do a select * in your query just remember that all that data needs to be sent to the machine that ran the query. That could take a long time depending on file size.
I hope I covered your question. If not please let me know and I'll try to help further.
The query
select * from table1 where col1>0
is just a map side job. So the data block is processed locally at every node. There is no need to collect data centrally.