How many threads does ClickHouse use when inserting data into a distributed table in synchronous mode? - insert

I want to figure out how many threads does ClickHouse use when it inserts some data into a distributed table, is there any configurations or settings to make this? I have a cluster which has 3 servers, a local table(engine: ReplicatedMergeTree) which has 3 shards and 2 replicas, a distributed table which points to this local table. I want to insert some data into this distributed table by synchronous mode(internal_replication=true, insert_distributed_sync=true, insert_quorum=2).
I've read the documents of ClickHouse and known that background_pool_size sets the number of merge threads and async_insert_threads is used in async mode instead of sync mode, which in my case. max_threads is used in query, not in insert. max_insert_threads is used in INSERT SELECT, not in insert.

Related

How to create Replicated table for Join Engine

There are Replicated shards on the cluster, and due to I can't create a 'ReplicatedJoin' engine table, I create a Distributed Engine table(join_dist) on the Join Engine local table(join_local). After I insert data into local table by proxy, I do this query: select count(1) from join_dist and I find that its result is approximate a half of actual value. I think this query only collect half shards result of cluster. How can I solve this?
You can use a multiplexing feature of Distributed table.
So you create an additional cluster remote_serves where all Clickhouse nodes a replicas in a single shard with internal_replication = false
Then you create a Distributed table using that new cluster.
Then you insert data into Distributed table and Distributed table multiplex inserts and write exactly the same data to all replicas (to all engine=Join tables).
Then you use select count(1) from join -- you don't need to use select count(1) from _dist because all join tables have the same data.

After insert, select count(1) from ReplicatedMergeTree engine table at once, get difference result for many times

The situation is that: I insert some data into ReplicatedMergeTree engine table, and I do this query select count(1) from table at once, and I get different results. As I know, this is caused by the Replicated mechanism, It will spend some time for Replicated shard copying data, so if the query routes to Replicated shard and will respond different result.
How can I avoid this problem if I want to use the data I insert at once?
For reading from Distributed table you can play with the next settings:
insert_quorum setting for INSERT queries. For example, for 3 replicas you specify insert_quorum = 3, so client will wait until data is replicates across all 3 replicas. https://clickhouse.com/docs/en/operations/settings/settings/#settings-insert_quorum
select_sequential_consistency. Setting for SELECT queries. Select will include the data written with insert_quorum. https://clickhouse.com/docs/en/operations/settings/settings/#settings-select_sequential_consistency

How to improve single insert performance in oracle

In my business case, I need insert one row and can't use batch insert . So I want to know what the throughput can made by Oracle. I try these ways:
Effective way
I use multi-thread, each thread owns one connection to insert data
I use ssd to store oracle datafile
Ineffective way
I use multi table to store data in one schema
I use table partition
I use multi schema to store data
Turn up data file block size
Use append hint in insert SQL
In the end the best TPS is 1w/s+
Other:
Oracle 11g
Single insert data size 1k
CPU i7, 64GB memory
Oracle is highly optimized for anything from one row inserts to batches of hundreds of rows. You do not mention whether you are having performance problems with this one row insert nor how long the insert takes. For such a simple operation, you don't need to worry about any of those details. If you have thousands of web-based users inserting one row into a table every minute, no problem. If you are committing your work at the appropriate time, and you don't have a huge number of indexes, a single row insert should not take more than a few milliseconds.
In SQL*Plus try the commands
set autotrace on explain statistics
set timing on
and run your insert statement.
Edit your question to include the results of the explain plan. And be sure to indent the results 4 spaces.

How Load distributed data in Hive works?

My target is to perform a SELECT query using Hive
When I have a small data on a single machine (namenode), I start by:
1-Creating a table that contains this data: create table table1 (int col1, string col2)
2-Loading the data from a file path: load data local inpath 'path' into table table1;
3-Perform my SELECT query: select * from table1 where col1>0
I have huge data, of 10 millions rows that doesn't fit into a single machine. Lets assume Hadoop divided my data into for example 10 datanodes and each datanode contains 1 million row.
Retrieving the data to a single computer is impossible due to its huge size or would take alot of time in case it is possible.
Will Hive create a table at each datanode and perform the SELECT query
or will Hive move all the data a one location (datanode) and create one table? (which is inefficient)
Ok, so I will walk through what happens when you load data into Hive.
The 10 million line file will be cut into 64MB/128MB blocks.
Hadoop, not Hive, will distribute the blocks to the different slave nodes on the cluster.
These blocks will be replicated several times. Default is 3.
Each slave node will contain different blocks that make up the original file, but no machine will contain every block. However, since Hadoop replicates the blocks there must be at least enough empty space on the cluster to accommodate 3x the file size.
When the data is in the cluster Hive will project the table onto the data. The query will be run on the machines Hadoop chooses to work on the blocks that make up the file.
10 million rows isn't that large though. Unless the table has 100 columns you should be fine in any case. However, if you were to do a select * in your query just remember that all that data needs to be sent to the machine that ran the query. That could take a long time depending on file size.
I hope I covered your question. If not please let me know and I'll try to help further.
The query
select * from table1 where col1>0
is just a map side job. So the data block is processed locally at every node. There is no need to collect data centrally.

oracle sqlloader paraller mode

When we are talking about parallel mode with sqlloader what does that actually mean? When i have in my script to execute:
Sqlldr control=first.ctl parallel=true direct=true data=first.unl
Sqlldr control=second.ctl parallel=true direct=true data=second.unl
I am inserting into 2 tables using as a data file for the inserts of the first table the first.unl and for the 2nd table the second.unl
By having parallel=true and direct=true will this run the 2 instances of sqlloader for first.unl and second.unl in parallel or will it run the first instance and do multiple inserts based on the first.unl and the run the second instance and do multiple inserts again based on the second.unl?
From the documentation
"PARALLEL specifies whether direct loads can operate in multiple concurrent sessions to load data into the same table."
So, one instance of SQL Loader uses multiple sessions to insert into one table. The actual degree of parallelism is governed by the usual parallelization parameters.
"So i cannot have inserting into multiple tables in parallel?"
If you kick off two SQL Loader instances they will run simultaneously. You need to be careful that you have enough CPU to handle the number of threads you're spawning.

Resources