count() query on ClickHouse replication - clickhouse

Hello ClickHouse expects,
I am testing ClickHouse replications without zookeeper to understand how it works and have 2 questions.
How I tested:
Setup 3 clickhouse-server (CH1, CH2, CH3) on different VMs (21.9.4 revision 54449)
Have a cluster using these 3 servers (see below for config)
Have a MergeTree (log_local) and a Distributed (log_all) tables
Send 100M logs to one server (CH1) through log_all; using clickhouse-client which is in a different VM
(Q1) After the insertion, I query count() of log_local from all 3 servers and the total number is as expected (i.e. 200M). However, when I query using log_all, the outcomes are different among servers (close to 200M but not exact). And even stranger, the count changes even within the same server. Can you please explain this behavior? Can it be a configuration issue? With no replica (3shards_1replica) test, I don’t see this count difference.
I see this is not a recommended so, eventually, I’d use a cluster coordinator - hoping clickhouse-keeper is in production by then. Before that stage, I am assessing if I can use this as a temporary solution with explainable shortcomings.
(Q2) This is more a generic question on replication. The count of log_all is 200M which include replica’s. What is the practical way that I query it without replica? I.e., select count() from log_all (or a different name) yields 100M not 200M.
Configs (I have modified some names from the original for not showing private information):
# remote_servers
<log_3shards_2replicas>
<shard>
<replica>
<host>CH1</host>
<port>9000</port>
</replica>
<replica>
<host>CH2</host>
<port>9000</port>
</replica>
</shard>
<shard>
<replica>
<host>CH2</host>
<port>9000</port>
</replica>
<replica>
<host>CH3</host>
<port>9000</port>
</replica>
</shard>
<shard>
<replica>
<host>CH3</host>
<port>9000</port>
</replica>
<replica>
<host>CH1</host>
<port>9000</port>
</replica>
</shard>
</log_3shards_2replicas>
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(my_time)
ORDER BY my_time
SETTINGS index_granularity = 8192
ENGINE = Distributed(‘log_3shards_2replicas', ‘my_db’, ‘log_local', rand())
Some references:
https://github.com/ClickHouse/ClickHouse/issues/2161
fillimonov commented - “Actually you can create replication without Zookeeper and ReplicatedMergeTree, just by using Distributed table above MergeTree and internal_replication=false cluster setting, but in that case there will no guarantee that all the replicas will have 100% the same data, so i rather would not recommend that scenario.”
Similar issue discussions:
https://github.com/ClickHouse/ClickHouse/issues/1443
https://github.com/ClickHouse/ClickHouse/issues/6735
Thanks in advance.

You have configured a "circle-replication". God help you.
THIS CONFIGURATION IS NOT SUPPORTED and is not covered by tests in CI.
Circle-replication is hard to configure, very unobvious for newbies.
Circle-replication provides a lot of issues and hardiness to debug.
A lot queries yield incorrect results.
The most of users who used circle-replication in the past moved to usual setup N shards * M replica and happy now.
https://kb.altinity.com/engines/
https://youtu.be/4DlQ6sVKQaA
In your configuration DEFAULT_DATABASE attribute is missing, circle-replication unable to work without it.

Related

Ideal Shard sizing for Clickhouse

I have to setup a multi-node cluster of Clickhouse. I have setup 3 machines for Zookeeper. Now I have 2 Clickhouse servers with 14 TB Storage each. My question is, how do I setup the number of shards? previously I used to single shard but right now I don't understand what's the best practice is for my infrastructure.
I have used the following configuration before which worked for me. Now with more storage, 14TB each, I don't get what's the best sizing for me.
<remote_servers>
<click_cluster>
<shard>
<replica>
<host>10.10.1.114</host>
<port>9000</port>
</replica>
<replica>
<host>10.10.1.115</host>
<port>9000</port>
</replica>
</shard>
</click_cluster>
</remote_servers>
<macros>
<shard>shard-01</shard>
<replica>replica-01</replica>
</macros>
How should I size my clickhouse ? What If in future I have 2 more Servers to add with same specifications ?

Collecting Performance metrics for kafka connect using hdfs sink connector

We are running kafka connect to get avro data from kafka topic and storing it a hdfs location using hdfs sink connector.
It basically stores 100 records (this is configuration through flush.size)to a single Avro record in hdfs location. I wanted to know how to calculate the performance metrics for it like no of records or messages written to hdfs location per second , bytes written per second, throughput , latency extra . We do not have kafka control center Configured also no jmx port enabled, through which it can be monitored like through jconsole.
Could this be calculated manually and how
If any other tool exists which I can use like Jmeter , if yes then how
I want to calculate application specific metrics not on system level.

Greenplum download dump to local cluster in parallel

Is there any more effective way to fetch the whole Greenplum's dump than doing it through multiple JDBC connections to master node?
I need to download the whole dump of Greenplum through JDBC. To do the job quicker I am going to use Spark parallelism (fetching data in parallel through multiple JDBC connections). As I understand, I will have multiple JDBC connections to Greenplum's single master node. I am going to store the data at HDFS in parquet format.
For parallel exporting, you can try gphdfs writable external table.
Gpdb segments can parallel write/read External sources.
http://gpdb.docs.pivotal.io/4340/admin_guide/load/topics/g-gphdfs.html
Now, you can use Greenplum-Spark connector to parallelize data transfer between Greenplum segments and Spark executors.
This greenplum-spark connector speeds up data transfer as it leverage parallel processing in Greenplum segments and Spark workers. Definitely, it is faster than using JDBC connector that transfer data via Greenplum master node.
Reference:
http://greenplum-spark.docs.pivotal.io/100/index.html

Hadoop Cassandra CqlInputFormat pagination

I am a quite newbie in Cassandra and have following question:
I have a 7-node Cassandra (v2.0.11) cluster and a table with 10k rows.
I run a hadoop job (datanodes reside on cassandra nodes of course) that reads data from that table and I see that only 7k rows is read to map phase.
I checked CqlInputFormat source code and noticed that a CQL query is build to select node-local date and also LIMIT clause is added (1k default). So that 7k read rows can be explained:
7 nodes * 1k limit = 7k rows read total
The limit can be changed using CqlConfigHelper:
CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "1000");
Please help me with questions below:
Is this a desired behavior?
Why CqlInputFormat does not page through the rest of rows?
Is it a bug or should I just increase the InputCQLPageRowSize value?
What if I want to read all data in table and do not know the row count?
My problem was related to a bug in cassandra 2.0.11 that added a strange LIMIT clause in underlying CQL query run to read data to the map task:
I posted that issue to cassandra jira: https://issues.apache.org/jira/browse/CASSANDRA-9074
It turned out that that problem was stricly related to the following bug fixed in cassandra 2.0.12: https://issues.apache.org/jira/browse/CASSANDRA-8166

PIG and HIVE connectivity to Datastax Cassandra running huge no of maps

I am using DSE3.2.4
I have created three tables which have 10M rows in one and 50k rows in other and other with just 10 rows
When I run a simple PIG or Hive query over these tables it is running same no.of mappers for both the tables.
In Pig by default pig.splitCombination is true where in it is running only one map
If I set this to false it is now running 513 maps.
In Hive by default it is running 513 maps
I tried in setting the following properties
mapred.min.split.size=134217728 in `mapred-site.xml` now running 513 maps for all
set pig.splitCombination=false in pig shell now running only 1 for all the tables
But no luck
finally I find mapred.map.tasks = 513 in job.xml
I tried to change this in mapred-site.xml but it is not reflecting
please help me in this
The mapper is managed by split size, so don't config it through hadoop settings, try pass &split_size= to your pig url. set "cassandra.input.split.size" for hive
default is 64M
If your Cassandra uses v-node, it creates many splits, so if you data is not big enough, then turn off v-node for hadoop nodes

Resources