Ideal Shard sizing for Clickhouse - clickhouse

I have to setup a multi-node cluster of Clickhouse. I have setup 3 machines for Zookeeper. Now I have 2 Clickhouse servers with 14 TB Storage each. My question is, how do I setup the number of shards? previously I used to single shard but right now I don't understand what's the best practice is for my infrastructure.
I have used the following configuration before which worked for me. Now with more storage, 14TB each, I don't get what's the best sizing for me.
<remote_servers>
<click_cluster>
<shard>
<replica>
<host>10.10.1.114</host>
<port>9000</port>
</replica>
<replica>
<host>10.10.1.115</host>
<port>9000</port>
</replica>
</shard>
</click_cluster>
</remote_servers>
<macros>
<shard>shard-01</shard>
<replica>replica-01</replica>
</macros>
How should I size my clickhouse ? What If in future I have 2 more Servers to add with same specifications ?

Related

count() query on ClickHouse replication

Hello ClickHouse expects,
I am testing ClickHouse replications without zookeeper to understand how it works and have 2 questions.
How I tested:
Setup 3 clickhouse-server (CH1, CH2, CH3) on different VMs (21.9.4 revision 54449)
Have a cluster using these 3 servers (see below for config)
Have a MergeTree (log_local) and a Distributed (log_all) tables
Send 100M logs to one server (CH1) through log_all; using clickhouse-client which is in a different VM
(Q1) After the insertion, I query count() of log_local from all 3 servers and the total number is as expected (i.e. 200M). However, when I query using log_all, the outcomes are different among servers (close to 200M but not exact). And even stranger, the count changes even within the same server. Can you please explain this behavior? Can it be a configuration issue? With no replica (3shards_1replica) test, I don’t see this count difference.
I see this is not a recommended so, eventually, I’d use a cluster coordinator - hoping clickhouse-keeper is in production by then. Before that stage, I am assessing if I can use this as a temporary solution with explainable shortcomings.
(Q2) This is more a generic question on replication. The count of log_all is 200M which include replica’s. What is the practical way that I query it without replica? I.e., select count() from log_all (or a different name) yields 100M not 200M.
Configs (I have modified some names from the original for not showing private information):
# remote_servers
<log_3shards_2replicas>
<shard>
<replica>
<host>CH1</host>
<port>9000</port>
</replica>
<replica>
<host>CH2</host>
<port>9000</port>
</replica>
</shard>
<shard>
<replica>
<host>CH2</host>
<port>9000</port>
</replica>
<replica>
<host>CH3</host>
<port>9000</port>
</replica>
</shard>
<shard>
<replica>
<host>CH3</host>
<port>9000</port>
</replica>
<replica>
<host>CH1</host>
<port>9000</port>
</replica>
</shard>
</log_3shards_2replicas>
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(my_time)
ORDER BY my_time
SETTINGS index_granularity = 8192
ENGINE = Distributed(‘log_3shards_2replicas', ‘my_db’, ‘log_local', rand())
Some references:
https://github.com/ClickHouse/ClickHouse/issues/2161
fillimonov commented - “Actually you can create replication without Zookeeper and ReplicatedMergeTree, just by using Distributed table above MergeTree and internal_replication=false cluster setting, but in that case there will no guarantee that all the replicas will have 100% the same data, so i rather would not recommend that scenario.”
Similar issue discussions:
https://github.com/ClickHouse/ClickHouse/issues/1443
https://github.com/ClickHouse/ClickHouse/issues/6735
Thanks in advance.
You have configured a "circle-replication". God help you.
THIS CONFIGURATION IS NOT SUPPORTED and is not covered by tests in CI.
Circle-replication is hard to configure, very unobvious for newbies.
Circle-replication provides a lot of issues and hardiness to debug.
A lot queries yield incorrect results.
The most of users who used circle-replication in the past moved to usual setup N shards * M replica and happy now.
https://kb.altinity.com/engines/
https://youtu.be/4DlQ6sVKQaA
In your configuration DEFAULT_DATABASE attribute is missing, circle-replication unable to work without it.

Collecting Performance metrics for kafka connect using hdfs sink connector

We are running kafka connect to get avro data from kafka topic and storing it a hdfs location using hdfs sink connector.
It basically stores 100 records (this is configuration through flush.size)to a single Avro record in hdfs location. I wanted to know how to calculate the performance metrics for it like no of records or messages written to hdfs location per second , bytes written per second, throughput , latency extra . We do not have kafka control center Configured also no jmx port enabled, through which it can be monitored like through jconsole.
Could this be calculated manually and how
If any other tool exists which I can use like Jmeter , if yes then how
I want to calculate application specific metrics not on system level.

Greenplum download dump to local cluster in parallel

Is there any more effective way to fetch the whole Greenplum's dump than doing it through multiple JDBC connections to master node?
I need to download the whole dump of Greenplum through JDBC. To do the job quicker I am going to use Spark parallelism (fetching data in parallel through multiple JDBC connections). As I understand, I will have multiple JDBC connections to Greenplum's single master node. I am going to store the data at HDFS in parquet format.
For parallel exporting, you can try gphdfs writable external table.
Gpdb segments can parallel write/read External sources.
http://gpdb.docs.pivotal.io/4340/admin_guide/load/topics/g-gphdfs.html
Now, you can use Greenplum-Spark connector to parallelize data transfer between Greenplum segments and Spark executors.
This greenplum-spark connector speeds up data transfer as it leverage parallel processing in Greenplum segments and Spark workers. Definitely, it is faster than using JDBC connector that transfer data via Greenplum master node.
Reference:
http://greenplum-spark.docs.pivotal.io/100/index.html

Rethink DB Cross Cluster Replication

I have 3 different pool of clients in 3 different geographical locations.
I need configure Rethinkdb with 3 different clusters and replicate data between the (insert, update and deletes). I do not want to use shard, only replication.
I didn't found in documentation if this is possible.
I didn't found in documentation how to configure multi-cluster replication.
Any help is appreciated.
I think that multi cluster is just same a single clusters with nodes in different data center
First, you need to setup a cluster, follow this document: http://www.rethinkdb.com/docs/start-a-server/#a-rethinkdb-cluster-using-multiple-machines
Basically using below command to join a node into cluster:
rethinkdb --join IP_OF_FIRST_MACHINE:29015 --bind all
Once you have your cluster setup, the rest is easy. Go to your admin ui, select the table, in "Sharding and replication", click Reconfigure and enter how many replication you want, just keep shard at 1.
You can also read more about Sharding and Replication at http://rethinkdb.com/docs/sharding-and-replication/#sharding-and-replication-via-the-web-console

HBase: Difference between Regionserver and QuorumPeer

I am new to Hbase. Now I have a simple question : what's the difference between regionserver and quorumpeer. Regionservers list is in the file regionserver and quorumpeer should be configured in HBase_site.xml. I guessed regions of a Hbase table can only be stored in region servers but I have no idea with quorumpeer. Should any node of hbase cluser be regionserver and quorumpeer at the same time? If you know, please explain to me. Thanks!
For hBase to work it needs Zookeeper so that the regionServers and Hmaster can communicate and transfer data. CHeck this out http://hbase.apache.org/book/zookeeper.html
You need to have a quorum of Zookeeper servers running (generally 3 or 5)
You have to list the nodes where Zk servers are running in the hbase.zookeeper.quorum property in HBase-site.xml

Resources