I have created replicated merge tree table as below:
CREATE TABLE probe.a on cluster dwh (
instime UInt64,
psn UInt64
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/probe/a', '{replica}') PARTITION BY instime ORDER BY (psn);
Then I created a distributed table as :
CREATE TABLE probe.a_distributed on cluster dwh (
instime UInt64,
psn UInt64
) ENGINE = Distributed(dwh,probe, a, rand());
I have then added macro in each server:
Server 1
<yandex>
<macros replace="true">
<shard>1</shard>
<replica>server1.com</replica>
</macros>
</yandex>
Server 2
<yandex>
<macros replace="true">
<shard>2</shard>
<replica>server2.com</replica>
</macros>
</yandex>
Remote Servers:
<dwh>
<!-- shard 01 -->
<shard>
<replica>
<host>server1.com</host>
<port>9000</port>
<user>default</user>
<password>test12pwd</password>
</replica>
</shard>
<!-- shard 02 -->
<shard>
<replica>
<host>server2.com</host>
<port>9000</port>
<user>default</user>
<password>test12pwd</password>
</replica>
</shard>
</dwh>
I have two issues when dropping partition:
When I drop partition using a distributed table
ALTER TABLE probe.a on cluster dwh DROP PARTITION '2020-03-13';
I get error:
DB::Exception: Table 'a' is replicated, but shard #4 isn't replicated
according to its cluster definition. Possibly
true is forgotten in the
cluster config. (version 19.16.14.65) (version 19.16.14.65)
Dropped partition individually but distributed table is showing half of the row still but when I check locally there is no row
How can this issue with distributed table be resolved for data sharded without replication?
you use Replicated tables. You MUST mark your shards with <internal_replication>true</internal_replication>.
<dwh>
<!-- shard 01 -->
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>server1.com</host>
<port>9000</port>
<user>default</user>
<password>test12pwd</password>
</replica>
</shard>
<!-- shard 02 -->
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>server2.com</host>
<port>9000</port>
<user>default</user>
<password>test12pwd</password>
</replica>
</shard>
</dwh>
Related
I am trying to join two tables as following
CREATE MATERIALIZED VIEW db.data_v ON CLUSTER shard1 TO db.table
AS
SELECT JSON_VALUE(db.table2_queue.message, '$.after.id') bid,
JSON_VALUE(message, '$.after.brand_id') AS brand_id,
JSON_VALUE(message, '$.after.id') AS id
FROM
db.table1_queue lq
Join db.table2_queue bq
on JSON_VALUE(bq.message, '$.after.id') = JSON_VALUE(lq.message, '$.after.brand_id')
However i got an empty result:
0 rows in set. Elapsed: 0.006 sec.
Trying to deploy clickhouse on k8s to use as graphite backend. New to clickhouse I have gone through links with same issue but none is helping me. Trying to create two clickhouse servers planning to add one more in-future.. clickhouse server deployed as k8 statefulset.
clickhouse1-0.clickhouse1-hs.ns-vaggarwal.svc.cluster.local :) select * from graphite
SELECT *
FROM graphite
Query id: 7ba316b8-bc88-4ab2-83d2-269a990f93b7
Received exception from server (version 20.12.4):
Code: 306. DB::Exception: Received from localhost:9000. DB::Exception: Stack size too large. Stack address: 0x7fd2d75fe000, frame address: 0x7fd2d79fd3f0, stack size: 4197392, maximum stack size: 8388608.
0 rows in set. Elapsed: 0.044 sec.
clickhouse1-0.clickhouse1-hs.ns-vaggarwal.svc.cluster.local :) exit
Example init.sql i am using.
CREATE TABLE IF NOT EXISTS default.graphite_index
(
Date Date,
Level UInt32,
Path String,
Version UInt32,
updated DateTime DEFAULT now(),
status Enum8('SIMPLE' = 0, 'BAN' = 1, 'APPROVED' = 2, 'HIDDEN' = 3, 'AUTO_HIDDEN' = 4)
)
ENGINE = ReplicatedReplacingMergeTree('/clickhouse/tables/single/default.graphite_index', '{replica}', updated)
PARTITION BY toYYYYMM(Date)
ORDER BY (Path)
SETTINGS index_granularity = 1024;
CREATE TABLE IF NOT EXISTS default.graphite (
Path String CODEC(ZSTD(2)),
Value Float64 CODEC(Delta, ZSTD(2)),
Time UInt32 CODEC(Delta, ZSTD(2)),
Date Date CODEC(Delta, ZSTD(2)),
Timestamp UInt32 CODEC(Delta, ZSTD(2))
) ENGINE = Distributed('graphite', '', graphite, xxHash64(Path));
CREATE DATABASE IF NOT EXISTS shard_01;
CREATE TABLE IF NOT EXISTS shard_01.graphite
AS default.graphite
ENGINE = ReplicatedGraphiteMergeTree('/clickhouse/tables/01/graphite', 'clickhouse1-0', 'graphite_rollup')
PARTITION BY toYYYYMM(Date)
ORDER BY (Path, Time);
CREATE DATABASE IF NOT EXISTS shard_02;
CREATE TABLE IF NOT EXISTS shard_02.graphite
AS default.graphite
ENGINE = ReplicatedGraphiteMergeTree('/clickhouse/tables/02/graphite', 'clickhouse1-0', 'graphite_rollup')
PARTITION BY toYYYYMM(Date)
ORDER BY (Path, Time);
Relevant part from config.xml
<remote_servers>
<graphite>
<!-- Shard 01 -->
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>clickhouse1-0</host>
<port>9000</port>
</replica>
<replica>
<host>clickhouse2-0</host>
<port>9000</port>
</replica>
</shard>
<!-- Shard 02 -->
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>clickhouse1-0</host>
<port>9000</port>
</replica>
<replica>
<host>clickhouse2-0</host>
<port>9000</port>
</replica>
</shard>
</graphite>
</remote_servers>
metric paths have been indexed.
clickhouse1-0.clickhouse1-hs.ns-vaggarwal.svc.cluster.local :) show tables
SHOW TABLES
Query id: 8cc91a1f-0be1-4ae8-98ed-411b06624968
┌─name───────────┐
│ graphite │
│ graphite_index │
└────────────────┘
2 rows in set. Elapsed: 0.002 sec.
clickhouse1-0.clickhouse1-hs.ns-vaggarwal.svc.cluster.local :) select * from graphite_index LIMIT 5;
SELECT *
FROM graphite_index
LIMIT 5
Query id: ced975d1-cd06-49b7-a18f-96e0e62504fd
┌───────Date─┬─Level─┬─Path────────────────────────────────────────────────────────────┬────Version─┬─────────────updated─┬─status─┐
│ 1970-02-12 │ 30005 │ active.pickle.carbon-clickhouse1-6489b8f7c8-mbzpr.agents.carbon │ 1612267498 │ 2021-02-02 12:04:58 │ SIMPLE │
│ 1970-02-12 │ 30005 │ active.pickle.carbon-clickhouse1-6489b8f7c8-ts6kk.agents.carbon │ 1612267558 │ 2021-02-02 12:05:58 │ SIMPLE │
│ 1970-02-12 │ 30005 │ active.pickle.carbon-clickhouse2-7898cd697d-jndms.agents.carbon │ 1612271795 │ 2021-02-02 13:16:35 │ SIMPLE │
│ 1970-02-12 │ 30005 │ active.pickle.carbon-clickhouse2-7898cd697d-pbns7.agents.carbon │ 1612271786 │ 2021-02-02 13:16:26 │ SIMPLE │
│ 1970-02-12 │ 30005 │ active.tcp.carbon-clickhouse1-6489b8f7c8-mbzpr.agents.carbon │ 1612267498 │ 2021-02-02 12:04:58 │ SIMPLE │
└────────────┴───────┴─────────────────────────────────────────────────────────────────┴────────────┴─────────────────────┴────────┘
5 rows in set. Elapsed: 0.002 sec. Processed 1.12 thousand rows, 110.75 KB (613.49 thousand rows/s., 60.45 MB/s.)
You made the indefinite loop:
CREATE TABLE IF NOT EXISTS default.graphite (
) ENGINE = Distributed('graphite', '', graphite, xxHash64(Path));
Distributed table points to itself.
must be Distributed('graphite', 'SOMEDATABASE'
or should use default database in remote_servers
<replica>
<default_database>
!!!!!! NEVER EVER use circular-replication
!!!!!! NEVER EVER use circular-replication
!!!!!! NEVER EVER use circular-replication
!!!!!! NEVER EVER use circular-replication
This article MUST be deleted. https://altinity.com/blog/2018/5/10/circular-replication-cluster-topology-in-clickhouse
This document made so much harm to ClickHouse.
I use ReplicatedMergeTree and Distributed table in clickhouse to make a HA cluster.
And I think it should store two replicas in cluster,it will be ok when one of node has so problems.
This is some of my configuration(config.xml):
...
<logs>
<shard>
<weight>1</weight>
<internal_replication>true</internal_replication>
<replica>
<host>node1</host>
<port>9000</port>
</replica>
<replica>
<host>node2</host>
<port>9000</port>
</replica>
</shard>
<shard>
<weight>1</weight>
<internal_replication>true</internal_replication>
<replica>
<host>node2</host>
<port>9000</port>
</replica>
<replica>
<host>node3</host>
<port>9000</port>
</replica>
</shard>
<shard>
<weight>1</weight>
<internal_replication>true</internal_replication>
<replica>
<host>node3</host>
<port>9000</port>
</replica>
<replica>
<host>node1</host>
<port>9000</port>
</replica>
</shard>
</logs>
...
<!-- each node is different -->
<macros>
<layer>01</layer>
<shard>01</shard>
<replica>node1</replica>
</macros>
<!-- below is node2 and node3 configuration
<macros>
<layer>02</layer>
<shard>02</shard>
<replica>node2</replica>
</macros>
<macros>
<layer>03</layer>
<shard>03</shard>
<replica>node3</replica>
</macros>
-->
...
And then I create table in each node by clickhouse-client --host cmd:
create table if not exists game(uid Int32,kid Int32,level Int8,datetime Date)
ENGINE = ReplicatedMergeTree('/clickhouse/data/{shard}/game','{replica}')
PARTITION BY toYYYYMMDD(datetime)
ORDER BY (uid,datetime);
After create ReplicatedMergeTree table , I then create distribute table in each node (just for each node have this table, in fact it only create on one node)
CREATE TABLE game_all AS game
ENGINE = Distributed(logs, default, game ,rand())
This is just ok now.And I also think it is ok when i insert data to game_all.But when I query data from game table and game_all table , I find it must be something wrong.
Because I insert one record to game_all table ,but the result is 3 which it must be one ,and I query each game table ,just one table has 1 record.Finally I check each node's disk and it seems to have no replicas in this table ,Because just one node have some disk use over 4KB ,others have no disk use just 4KB.
I can create and drop tables and do query normally in Presto, but when I use insert, it's always wrong as shown bellow:
presto:default> create table test.lll (a int);
CREATE TABLE
presto:default> insert into test.lll select 1;
Query 20180104_091933_00007_k8e78, FAILED, 5 nodes
Splits: 84 total, 30 done (35.71%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
Query 20180104_091933_00007_k8e78 failed: No page sink provider for connector 'hive'
What is the reason and how to address it? Any help is appreciated.
Error Type: INTERNAL_ERROR
Error Code: GENERIC_INTERNAL_ERROR (65536)
Full stack trace:
java.lang.IllegalArgumentException: No page sink provider for connector 'hive'
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:191)
at com.facebook.presto.split.PageSinkManager.providerFor(PageSinkManager.java:67)
at com.facebook.presto.split.PageSinkManager.createPageSink(PageSinkManager.java:61)
at com.facebook.presto.operator.TableWriterOperator$TableWriterOperatorFactory.createPageSink(TableWriterOperator.java:97)
at com.facebook.presto.operator.TableWriterOperator$TableWriterOperatorFactory.createOperator(TableWriterOperator.java:88)
at com.facebook.presto.operator.DriverFactory.createDriver(DriverFactory.java:92)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunnerFactory.createDriver(SqlTaskExecution.java:515)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunnerFactory.access$1400(SqlTaskExecution.java:490)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:616)
at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
at com.facebook.presto.execution.executor.LegacyPrioritizedSplitRunner.process(LegacyPrioritizedSplitRunner.java:23)
at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:492)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)`
Executing hive query with filter on virtual column INPUT__FILE__NAME result in following exception.
hive> select count(*) from netflow where INPUT__FILE__NAME='vzb.1351794600.0';
FAILED: SemanticException java.lang.RuntimeException: cannot find field input__file__name from [org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField#1d264bf5, org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField#3d44d0c6,
.
.
.
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField#7e6bc5aa]
This error is different from the one we get when column name is wrong
hive> select count(*) from netflow where INPUT__FILE__NAM='vzb.1351794600.0';
FAILED: SemanticException [Error 10004]: Line 1:35 Invalid table alias or column reference 'INPUT__FILE__NAM': (possible column names are: first, last, ....)
But using this virtual column in select clause works fine.
hive> select INPUT__FILE__NAME from netflow group by INPUT__FILE__NAME;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 4
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201306041359_0006, Tracking URL = http://192.168.0.224:50030/jobdetails.jsp?jobid=job_201306041359_0006
Kill Command = /opt/hadoop/bin/../bin/hadoop job -kill job_201306041359_0006
Hadoop job information for Stage-1: number of mappers: 12; number of reducers: 4
2013-06-14 18:20:10,265 Stage-1 map = 0%, reduce = 0%
2013-06-14 18:20:33,363 Stage-1 map = 8%, reduce = 0%
.
.
.
2013-06-14 18:21:15,554 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201306041359_0006
MapReduce Jobs Launched:
Job 0: Map: 12 Reduce: 4 HDFS Read: 3107826046 HDFS Write: 55 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
hdfs://192.168.0.224:9000/data/jk/vzb/vzb.1351794600.0
Time taken: 78.467 seconds
I am trying to create external hive table on already present HDFS data. And I have extra files in the folder that I want to ignore. Similar to what is asked and suggested in following stackflow questions
how to make hive take only specific files as input from hdfs folder
when creating an external table in hive can I point the location to specific files in a direcotry?
Any help would be appreciated.
Full stack trace I am getting is as follows
2013-06-14 15:01:32,608 ERROR ql.Driver (SessionState.java:printError(401)) - FAILED: SemanticException java.lang.RuntimeException: cannot find field input__
org.apache.hadoop.hive.ql.parse.SemanticException: java.lang.RuntimeException: cannot find field input__file__name from [org.apache.hadoop.hive.serde2.object
at org.apache.hadoop.hive.ql.optimizer.pcr.PcrOpProcFactory$FilterPCR.process(PcrOpProcFactory.java:122)
at org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:89)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:87)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:124)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:101)
at org.apache.hadoop.hive.ql.optimizer.pcr.PartitionConditionRemover.transform(PartitionConditionRemover.java:86)
at org.apache.hadoop.hive.ql.optimizer.Optimizer.optimize(Optimizer.java:102)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:8163)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258)
at org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:50)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:258)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:335)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:893)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:412)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:755)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: cannot find field input__file__name from [org.apache.hadoop.hive.ser
at org.apache.hadoop.hive.ql.optimizer.ppr.PartitionPruner.prune(PartitionPruner.java:231)
at org.apache.hadoop.hive.ql.optimizer.pcr.PcrOpProcFactory$FilterPCR.process(PcrOpProcFactory.java:112)
... 23 more
Caused by: java.lang.RuntimeException: cannot find field input__file__name from [org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyF
at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:344)
at org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector.getStructFieldRef(UnionStructObjectInspector.java:100)
at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:57)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator.initialize(ExprNodeGenericFuncEvaluator.java:128)
at org.apache.hadoop.hive.ql.optimizer.ppr.PartExprEvalUtils.prepareExpr(PartExprEvalUtils.java:100)
at org.apache.hadoop.hive.ql.optimizer.ppr.PartitionPruner.pruneBySequentialScan(PartitionPruner.java:328)
at org.apache.hadoop.hive.ql.optimizer.ppr.PartitionPruner.prune(PartitionPruner.java:219)
... 24 more