Clickhouse shows duplicates data in distributed table - clickhouse

I have 3 nodes with 3 shards and 2 replicas on each:
​CLickhouse cluster settings
Added also the XML config for the sharding and replicas
<default_cluster>
<shard>
<internal_replication>true</internal_replication>
<replica>
<default_database>shard</default_database>
<host>clickhouse-0</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
<replica>
<default_database>replica</default_database>
<host>clickhouse-2</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<default_database>shard</default_database>
<host>clickhouse-1</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
<replica>
<default_database>replica</default_database>
<host>clickhouse-0</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<default_database>shard</default_database>
<host>clickhouse-2</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
<replica>
<default_database>replica</default_database>
<host>clickhouse-1</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
</shard>
</default_cluster>
I am doing the following example:
create database test on cluster default_cluster;
CREATE TABLE test.test_distributed_order_local on cluster default_cluster
(
id integer,
test_column String
)
ENGINE = ReplicatedMergeTree('/default_cluster/test/tables/test_distributed_order_local/{shard}', '{replica}')
PRIMARY KEY id
ORDER BY id;
CREATE TABLE test.test_distributed_order on cluster default_cluster as test.test_distributed_order_local
ENGINE = Distributed(default_cluster, test, test_distributed_order_local, id);
insert into test.test_distributed_order values (1, 'test1');
insert into test.test_distributed_order values (2, 'test2');
insert into test.test_distributed_order values (3, 'test3');
The results are not the same, and they contain duplications: Eg
​Result 1
​Result 2
​
What am I missing?
I expect to not have duplicated rows in the select

I think this post probably sums up what you're trying to achieve - https://altinity.com/blog/2018/5/10/circular-replication-cluster-topology-in-clickhouse
It's a little old but the principle applies - For Clickhouse not a topology that's recommended.
Consider this simplified example:
<shard>
// These two are replicas of each other
<replica>
<host>cluster_node_0</host>
</replica>
**<replica>
<host>cluster_node_2</host>
</replica>**
</shard>
<shard>
<replica>
<host>cluster_node_1</host>
</replica>
<replica>
<host>cluster_node_0</host>
</replica>
</shard>
<shard>
**<replica>
<host>cluster_node_2</host>
</replica>**
<replica>
<host>cluster_node_1</host>
</replica>
</shard>
Let's suppose data is written into the first shard on node cluster_node_0. It will then be replicated to the shard on cluster_node_2 - as the zookeeper path is the same.
Now for the issue. You have also defined the 3rd shard on cluster_node_2. When you create this table, it will physically contain data from 2 shards - the 1st and 3rd - I've attempted to highlight with **.
When a query comes in, it will be sent to each shard. The challenge is each local table will respond with results from both shards - hence you get duplicates.
Generally, avoid more than one shard on a host - the blog explains how you can achieve more than one buts its not recommended or ever need.

ClickHouse show duplicates cause you use the same hosts in multiple shards
During execution of SELECT your query, it rewrites and execute in one replica in each shard.
Because same replica presents in different shards and query run twice.
Usually shard means data is not intersected between other shards
If you want a cluster for 3 shards and 2 replicas in each shard
You need 6 different replicas clickhouse-0..5

Related

Create table on cluster of clickhouse error

When I create table as follows:
CREATE TABLE partition_v3_cluster ON CLUSTER perftest_3shards_3replicas(
ID String,
URL String,
EventTime Date
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(EventTime)
ORDER BY ID;
I get errors:
Query id: fe98c8b6-16af-44a1-b8c9-2bf10d9866ea
┌─host────────┬─port─┬─status─┬─error───────────────────────────────────────────────────────────────────────────────────────────────────────────── ──────────────────────────────────────────────────────────────────────────┬─num_hosts_remaining─┬─num_hosts_active─┐
│ 10.18.1.131 │ 9000 │ 371 │ Code: 371, e.displayText() = DB::Exception: There are two exactly the same ClickHouse instances 10.18.1.131:9000 i n cluster perftest_3shards_3replicas (version 21.6.3.14 (official build)) │ 2 │ 0 │
│ 10.18.1.133 │ 9000 │ 371 │ Code: 371, e.displayText() = DB::Exception: There are two exactly the same ClickHouse instances 10.18.1.133:9000 i n cluster perftest_3shards_3replicas (version 21.6.3.14 (official build)) │ 1 │ 0 │
│ 10.18.1.132 │ 9000 │ 371 │ Code: 371, e.displayText() = DB::Exception: There are two exactly the same ClickHouse instances 10.18.1.132:9000 i n cluster perftest_3shards_3replicas (version 21.6.3.14 (official build)) │ 0 │ 0 │
└─────────────┴──────┴────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────── ──────────────────────────────────────────────────────────────────────────┴─────────────────────┴──────────────────┘
← Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) 0%
3 rows in set. Elapsed: 0.149 sec.
Received exception from server (version 21.6.3):
Code: 371. DB::Exception: Received from localhost:9000. DB::Exception: There was an error on [10.18.1.131:9000]: Code: 371, e.displayText() = DB:: Exception: There are two exactly the same ClickHouse instances 10.18.1.131:9000 in cluster perftest_3shards_3replicas (version 21.6.3.14 (official build)).
And here is my metrika.xml:
<?xml version="1.0" encoding="utf-8"?>
<yandex>
<remote_servers>
<perftest_3shards_3replicas>
<shard>
<replica>
<host>10.18.1.131</host>
<port>9000</port>
</replica>
<replica>
<host>10.18.1.132</host>
<port>9000</port>
</replica>
<replica>
<host>10.18.1.133</host>
<port>9000</port>
</replica>
</shard>
<shard>
<replica>
<host>10.18.1.131</host>
<port>9000</port>
</replica>
<replica>
<host>10.18.1.132</host>
<port>9000</port>
</replica>
<replica>
<host>10.18.1.133</host>
<port>9000</port>
</replica>
</shard>
<shard>
<replica>
<host>10.18.1.131</host>
<port>9000</port>
</replica>
<replica>
<host>10.18.1.132</host>
<port>9000</port>
</replica>
<replica>
<host>10.18.1.133</host>
<port>9000</port>
</replica>
</shard>
</perftest_3shards_3replicas>
</remote_servers>
<zookeeper>
<node>
<host>10.18.1.131</host>
<port>2181</port>
</node>
<node>
<host>10.18.1.132</host>
<port>2181</port>
</node>
<node>
<host>10.18.1.133</host>
<port>2181</port>
</node>
</zookeeper>
<macros>
<shard>01</shard>
<replica>01</replica>
</macros>
</yandex>
I don't know where is wrong, can someone help me? Thank u

Sentiment Analysis of twitter data using hadoop and pig

Tweets from twitter are stored in hdfs in hadoop.
The tweets need to be processed for sentiment analysis. The tweets in hdfs are in avro format so they need to be processed using Json loader But in pig scripting the tweets from hdfs are not getting read.After changing jar files the pig script is showing failed message
By using these following jar files by pig script is getting failed.
REGISTER '/home/cloudera/Desktop/elephant-bird-hadoop-compat-4.17.jar';
REGISTER '/home/cloudera/Desktop/elephant-bird-pig-4.17.jar';
REGISTER '/home/cloudera/Desktop/json-simple-3.1.0.jar';
These are another set of jar files with which its not failing but data is also not getting read.
REGISTER '/home/cloudera/Desktop/elephant-bird-hadoop-compat-4.17.jar';
REGISTER '/home/cloudera/Desktop/elephant-bird-pig-4.17.jar';
REGISTER '/home/cloudera/Desktop/json-simple-1.1.jar';
Here is all my pig scripting commands i have used:
tweets = LOAD '/user/cloudera/OutputData/tweets' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH tweets GENERATE myMap#'id' as id ,myMap#'tweets' as tweets;
tokens = foreach B generate id, tweets, FLATTEN(TOKENIZE(tweets)) As word;
dictionary = load ' /user/cloudera/OutputData/AFINN.txt' using PigStorage('\t') AS(word:chararray,rating:int);
word_rating = join tokens by word left outer, dictionary by word using 'replicated';
describe word_rating;
rating = foreach word_rating generate tokens::id as id,tokens::tweets as tweets, dictionary::rating as rate;
word_group = group rating by (id,tweets);
avg_rate = foreach word_group generate group, AVG(rating.rate) as tweet_rating;
positive_tweets = filter avg_rate by tweet_rating>=0;
DUMP positive_tweets;
negative_tweets = filter avg_rate by tweet_rating<=0;
DUMP negative_tweets;
Error on dumping above tweets command for the first set of jar files:
Input(s):
Failed to read data from "/user/cloudera/OutputData/tweets"
Output(s):
Failed to produce result in "hdfs://quickstart.cloudera:8020/tmp/temp-1614543351/tmp37889715"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1556902124324_0001
2019-05-03 09:59:09,409 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
2019-05-03 09:59:09,427 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias tweets. Backend error : org.json.simple.parser.ParseException
Details at logfile: /home/cloudera/pig_1556902594207.log
Error on dumping above tweets command for the second set of jar files:
Input(s):
Successfully read 0 records (5178477 bytes) from: "/user/cloudera/OutputData/tweets"
Output(s):
Successfully stored 0 records in: "hdfs://quickstart.cloudera:8020/tmp/temp-1614543351/tmp479037703"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1556902124324_0002
2019-05-03 10:01:05,417 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2019-05-03 10:01:05,418 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2019-05-03 10:01:05,418 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2019-05-03 10:01:05,428 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2019-05-03 10:01:05,428 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
Expected output was sorted positive and neative tweets but getting errors.
Please do help. Thank you.
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias tweets. Backend error : org.json.simple.parser.ParseException This usually indicates a syntax error in the Pig script.
The AS keyword in a LOAD statement usually require a schema. myMap in your LOAD statement is not a valid schema.
See https://stackoverflow.com/a/12829494/8886552 for an example of JsonLoader.

Why es cluster stop to work until i delete the old index?

In es document,it introduce that,If we restart Node 1,If Node 1 still has copies of the old shards, it will try to reuse them, copying over from the primary shard only the files that have changed in the meantime.
So I did an experiment.
Here are 5 nodes in my cluster,Primary shards 1 is saved in node 1,and replica shards 1 is saved in node 2.When i restart node 1 and node 2,Primary shards 1's state become UNASSIGNED,and replica shards 1's state become UNASSIGNED too,the health of the cluster become red,and the health never become green.And the cluster stop to work until i delete the old index.
Here is part of the master log.
[ERROR][marvel.agent ] [es10] background thread had an uncaught exception
ElasticsearchException[failed to flush exporter bulks]
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:104)
at org.elasticsearch.marvel.agent.exporter.ExportBulk.close(ExportBulk.java:53)
at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:201)
at java.lang.Thread.run(Thread.java:745)
Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution, only the first 100 failures are printed:
[8]: index [.marvel-es-data], type [cluster_info], id [nm4dj3ucSRGsdautV_GDDw], message [UnavailableShardsException[[.marvel-es-data][1] primary shard is not active Timeout: [1m], request: [shard bulk {[.marvel-es-data][1]}]]]];
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:106)
... 3 more
Caused by: ElasticsearchException[failure in bulk execution, only the first 100 failures are printed:
[8]: index [.marvel-es-data], type [cluster_info], id [nm4dj3ucSRGsdautV_GDDw], message [UnavailableShardsException[[.marvel-es-data][1] primary shard is not active Timeout: [1m], request: [shard bulk {[.marvel-es-data][1]}]]]]
at org.elasticsearch.marvel.agent.exporter.local.LocalBulk.flush(LocalBulk.java:114)
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:101)
... 3 more
[2016-02-19 12:53:18,769][ERROR][marvel.agent ] [es10] background thread had an uncaught exception
ElasticsearchException[failed to flush exporter bulks]
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:104)
at org.elasticsearch.marvel.agent.exporter.ExportBulk.close(ExportBulk.java:53)
at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:201)
at java.lang.Thread.run(Thread.java:745)
Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution, only the first 100 failures are printed:
[8]: index [.marvel-es-data], type [cluster_info], id [nm4dj3ucSRGsdautV_GDDw], message [UnavailableShardsException[[.marvel-es-data][1] primary shard is not active Timeout: [1m], request: [shard bulk {[.marvel-es-data][1]}]]]];
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:106)
... 3 more
Caused by: ElasticsearchException[failure in bulk execution, only the first 100 failures are printed:
[8]: index [.marvel-es-data], type [cluster_info], id [nm4dj3ucSRGsdautV_GDDw], message [UnavailableShardsException[[.marvel-es-data][1] primary shard is not active Timeout: [1m], request: [shard bulk {[.marvel-es-data][1]}]]]]
at org.elasticsearch.marvel.agent.exporter.local.LocalBulk.flush(LocalBulk.java:114)
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:101)
... 3 more

Use PIG to count the number of records in an avro file

I can open a avro file in HUE and HUE shows me it has 10 records. i can browse through all the 10 records in HUE.
Now I write the following code in PIG
data = LOAD '/user/admin/2015/10/04/02/file1.avro' USING AvroStorage();
data_group = GROUP data ALL;
row_count = FOREACH data_group GENERATE COUNT(data);
dump row_count;
The output of the job is
Input(s):
Successfully read 4 records (58507 bytes) from: "/user/admin/2015/10/04/02/file1.avro"
Output(s):
Successfully stored 1 records (6 bytes) in: "hdfs://nn1/tmp/temp-268177355/tmp915757783"
Counters:
Total records written : 1
Total bytes written : 6
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1438959478020_940907
2015-10-29 19:08:55,252 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2015-10-29 19:08:55,252 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-10-29 19:08:55,253 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2015-10-29 19:08:55,261 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2015-10-29 19:08:55,261 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(4)
How did 10 become 4. Is there a different way to count the number of records in an avro file using PIG?

HBase Hive Integration - Error

When I try to Load Data from HDFS to HBase using Hive logical tables, I am facing the following problem. I am new for hadoop and not able to trace the error,.I am using CDH4 VM,
Creating a new HBase table which is managed by Hive
CREATE TABLE hive_hbasetable(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "hivehbasek1");
Hbase shell Output
hbase(main):002:0> list
TABLE
hivebasek1
mysql_cityclimate
2 row(s) in 0.2470 seconds
I created a logical table hive_logictable in Hive
CREATE TABLE hive_logictable (foo INT, bar STRING) row format delimited fields terminated by ',';
Inserting data in hive_logictable from HDFS.
cat TextFile.txt
100,value1
101,value2
102,value3
103,value4
104,value5
105,value6
LOAD DATA LOCAL INPATH '/home/cloudera/TextFile.txt' OVERWRITE INTO TABLE hive_logictable;
Loading data into HBase table using Hive.
INSERT OVERWRITE TABLE hive_hbasetable SELECT * FROM hive_logictable;
Below are the error messages throwing....
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201501200937_0004, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201501200937_0004
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201501200937_0004
Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
2015-01-20 10:38:07,412 Stage-0 map = 0%, reduce = 0%
2015-01-20 10:38:52,822 Stage-0 map = 100%, reduce = 100%
Ended Job = job_201501200937_0004 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201501200937_0004
Examining task ID: task_201501200937_0004_m_000002 (and more) from job job_201501200937_0004
Task with the most failures(4):
-----
Task ID:
task_201501200937_0004_m_000000
URL:
http://localhost.localdomain:50030/taskdetails.jsp?jobid=job_201501200937_0004&tipid=task_201501200937_0004_m_000000
-----
Diagnostic Messages for this Task:
java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:413)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.ja
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
End of Error Message.
Could you please check if the atomic insert works fine on the HIVE table ? And share the results ?

Resources