question about clickhouse ReplicatedMergeTree and Distributed ENGINE - clickhouse

Why ReplicatedMergeTree ENGINE didn't replica any data ?
I have three nodes to deploy ClickHouse,
I config a cluster with 3shard 2 replica, and I create a table with ReplicatedMergeTree ENGINE on all the three nodes, then I have inserted one row on one of the three nodes, I can only query data out from the node where I do insert, why? I've configured 2 more replicas, I think I should query out data from other nodes.
And if I create a table with Distributed ENGINE base on the table I created with ReplicatedMergeTree ENGINE, if I insert one row in it, I can query out two rows on the node where I do insert, but on other nodes sometimes I can query out one row sometimes I can query out nothing

Related to Distributed Engine Table in CLickhouse.It sounds like everything is fine and you get half data from one server and full results (both halves) from the distributed table. You have put only half data to each shard, not the whole dataset.

you may have the same problem as me here:
Clickhouse shows duplicates data in distributed table
Solution: don't use circular cluster, or pay attention on how you create the tables

Related

Is it possible to get data both hadoop cluster?

I have two hadoop cluster and i am moving one big table to another cluster.
But i do not have enough space to move the whole table and open the table to the users. So while i m moving the table to the another cluster i need to show my table data from both hadoop cluster.
I know i can use presto for this but the table is too big and i have problem with presto cluster server ram capacity.
So is there any way to do this?

How to do small queries efficiently in Clickhouse

In our deployment, there are one thousand shards. The insertions are done via a distributed table with sharding jumpConsistentHash(colX, 1000). When I query for rows with colX=... and turn on send_logs_level='trace', I see the query is sent to all shards and is executed on each shard. This is limiting our QPS (queries per second). Checking with Clickhouse document, it states:
SELECT queries are sent to all the shards and work regardless of how data is distributed across the shards (they can be distributed completely randomly).
When you add a new shard, you don’t have to transfer the old data to it.
You can write new data with a heavier weight – the data will be distributed slightly unevenly, but queries will work correctly and efficiently.
You should be concerned about the sharding scheme in the following cases:
* Queries are used that require joining data (IN or JOIN) by a specific key. If data is sharded by this key, you can use local IN or JOIN instead of GLOBAL IN or GLOBAL JOIN, which is much more efficient.
* A large number of servers is used (hundreds or more) with a large number of small queries (queries of individual clients - websites, advertisers, or partners).
In order for the small queries to not affect the entire cluster, it makes sense to locate data for a single client on a single shard.
Alternatively, as we’ve done in Yandex.Metrica, you can set up bi-level sharding: divide the entire cluster into “layers”, where a layer may consist of multiple shards.
Data for a single client is located on a single layer, but shards can be added to a layer as necessary, and data is randomly distributed within them.
Distributed tables are created for each layer, and a single shared distributed table is created for global queries.
It seems there is a solution for such small queries as in our case (the second bullet above), but I am not clear about the point. Does it mean when querying for a specific query with predicate colX=..., I need to find the corresponding "layer" that contains its rows and then query on the corresponding distributed table for this layer?
Is there a way to query on the global distributed table for these small queries?

Strange replication in Cassandra

I have configured locally 3 nodes in on 'Test Cluster' of Cassandra. When I run them and create some keyspace or table also on all three nodes the keyspace or the table appears.
The problem I'm dealing with is, when I'm importing from CSV millions of rows in the table I already built the whole data suddenly appears on all three nodes. I have the same data replicated over the three nodes.
As I'm familiar with, the data I'm importing should be replicated/distributed over the nodes but partially. One partition on the first node, second on third, third on second node, fourth again on first node and ...
Am I right or I'm missing something big?
Also, my write speed locally is about 10k rows / second for the multi-node cluster. Isn't that a little bit too low?
I want to create discussion so I can maybe learn something more from your experience and see where I'm messing things.
Thank you!
The number of nodes that data is written to in your cluster is determined by the Replication Factor for that keyspace. If you have 3 nodes and the data is being written to all the nodes, then this setting must be set to 3. If you only want the data the be replicated to two nodes, you'd set this value to two.
Your write speed will be affected by the consistency level you are specifying on the write. If you have it set to ALL then you have to wait until all the nodes that are going to write the data have written the data (in your case all 3 nodes based on your replication factor). Dropping your consistency level on the write will probably net you faster write times. There is a balance between your replication factor, write consistency level, and read consistency level that you can research further.

Realizing different distribution models in hdfs?

As far as i have got to understand from the hadoop tuitorial, it takes the overall size of the input files and then divides them into the blocks/chunks then these block are replicated on different nodes.However i want to realize data distribution model according to the below given requirement -
(a) Case one : Each file is partitioned into the nodes in the cluster equally
-- so that each map gets this partition of table to be accessed. is it possible ?
(b) Case two : Each file is fully replicated in two or more nodes but not all nodes.
so that each map access some part of table on each node. is it possible ?
HDFS does not store tables, it stores files. Higher level projects offer 'relational tables', like Hive. Hive does allow you to partition a table stored on HDFS, see Hive Tutorial.
That being said, you should not tie partitioning to number of nodes in the cluster. Nodes come and go, clusters grow and shrink. Partitioned relational tables partition/bucket by natural boundaries w/o relation to cluster size. Import, export, daily operations all play a role in partitioning (and usually a much bigger role then cluster size). Even a single table (file) can well spread on each node of the cluster.
If you want to tune a MR job for optimal split size/location, there are plenty of ways to do that. You still have a lot to read, you are optimizing too early.

Is a collocated join (a-la-netezza) theoretically possible in hive?

When you join tables which are distributed on the same key and used these key columns in the join condition, then each SPU (machine) in netezza works 100% independent of the other (see nz-interview).
In hive, there's bucketed map join, but the distribution of the files representing the tables to datanode is the responsibility of HDFS, it's not done according to hive CLUSTERED BY key!
so suppose I have 2 tables, CLUSTERED BY the same key, and I join by that key - can hive get a guarantee from HDFS that matching buckets will sit on the same node? or will it always have to move the matching bucket of the small table to the datanode containing the big table bucket?
Thanks, ido
(note: this is a better phrasing of my previous question: How does hive/hadoop assures that each mapper works on data that is local for it?)
I think it is not possible to tell to HDFS where to store blocks of data.
I can consider the following trick, which will do for small clusters - to increase replication factor for one of the tables to the number close or equal to the number of nodes in the cluster.
As a result - during join process appropriate data will be almost always (or always) present on the required node.

Resources