Clickhouse - "Too many partitions for single INSERT block" - insert

During a reload of a replicated MySQL database to Clickhouse, using "clickhouse-mysql" I run into the "Too many partitions for single INSERT block" and I seem unable to progress.
So far, some of the things I've tried:
setting "max_partitions_per_insert_block" to ZERO, to see if it would
get through.
using --mempool-max-rows to 5000, 10.000 and
20.000, so that it skips the 100.000 default loading
using PARTITION BY toYYYYMMDD(created) when creating the table
The Clickhouse table is created as, which is pretty similar from how the automatic creation (--create-table) was with the exception it did miss a few of the NULLABLE types.:
CREATE TABLE DB.DB__main (
`id` Int64,
`user_id` Int64,
`screenname` String,
`created` DateTime,
`tweet_id` Int64,
`tweet` String,
`retweet_count` Nullable(Int32),
`mediastatus` Nullable(String),
`country` Nullable(String),
`countrycode` Nullable(String),
`city` Nullable(String),
`latitude0` Nullable(String),
`longitude0` Nullable(String),
`latitude1` Nullable(String),
`longitude1` Nullable(String),
`latitude2` Nullable(String),
`longitude2` Nullable(String),
`latitude3` Nullable(String),
`longitude3` Nullable(String),
`datetime` DateTime,
`datetime_update` Nullable(DateTime),
`status` Nullable(String),
`div0` Nullable(String),
`div1` Nullable(String),
`div2` Nullable(Int64),
`datasource` Nullable(String)
) ENGINE = ReplacingMergeTree() PARTITION BY toYYYYMM(created) ORDER BY (id, user_id, screenname, created, tweet_id, datetime)
Also, why do the schema get repeated DB.DB__tablename? I got the weird situation when I first started using Clickhouse and clickhouse-mysql --create-table. It stopped when it was to start migrate the content and it took a while before I realized the table name where changed from "schema"."table-name" to "schema"."schema__table-name". After renaming the table-names the --migrate-table could run.

max_partitions_per_insert_block -- Limit maximum number of partitions in single INSERTed block. Zero means unlimited. Throw exception if the block contains too many partitions. This setting is a safety threshold, because using large number of partitions is a common misconception.
By default max_partitions_per_insert_block = 100
So PARTITION BY toYYYYMMDD(created) your insert will fail if your insert covers more than 100 different days.
PARTITION BY toYYYYMM(created) your insert will fail if your insert covers more than 100 different months.
Nullable -- eats up to twice disk space and up to twice slower than notNullable.
schema get repeated DB.DB__tablename -- ask Altinity the creator of clickhouse-mysql -- looks like a bug.

Related

Why my table takes more spaces even it have less row then other table?

I have the following table
Create table my_source(
ID number(15) not null,
Col_1 Varchar 2(3000),
Col_2 Varchar 2(3000),
Col_3 Varchar 2(3000),
Col_4 Varchar 2(3000),
Col_5 Varchar 2(3000),
...
Col_90 Varchar 2(3000)
);
This table have 6,926,220 rows.
Now I am going to create two table based on this table.
Target1
Create table el_temp as
select
id,
Col_1,
Col_2,
Col_3,
Col_4,
Col_5,
Col_6,
Col_7,
Col_8,
Col_9,
Col_10,
Col_11,
Col_12
from
my_source;
Target2:
Create table el_temp2 as
select DISTINCT
id,
Col_1,
Col_2,
Col_3,
Col_4,
Col_5,
Col_6,
Col_7,
Col_8,
Col_9,
Col_10,
Col_11,
Col_12
from
my_source;
select count(*) from el_temp; -- 6926220
select count(*) from el_temp2; --6880832
The only difference between el_temp and el_temp2 is the "distinct" operator.
Now I got the following result from SQL Developer
It is a surprise result to me that EL_TEMP, the one with more rows have a smaller size, while the el_temp2 have less row but a bigger size.
Could anyone share me any reason and how to avoid this happen?
Thanks in advance!
The most likely cause is that the table has undergone some updates to existing rows over its lifetime.
By default, when you create a table, we reserve 10% of the space in each block for rows to grow (due to updates). As updates occur, that space is used up, so your blocks might be (on average) around 95% full.
When you do "create table as select" from that table to another, we will take those blocks and pad them out again to 10% free space, thus making it slightly larger.
If PCTFREE etc is unfamiliar to you, I've also got a tutorial video to get you started here
https://youtu.be/aOZMp5mncqA

How to decide the partition key for clickhouse

I want to know what's the best practice for the partition key.
In my project, we have a table with event_date, app_id and other columns. The app_id will be growing and could be thousands.
The select query is based on event_date and app_id.
The simple data schema is as below:
CREATE TABLE test.test_custom_partition (
company_id UInt64,
app_id String,
event_date DateTime,
event_name String ) ENGINE MergeTree()
PARTITION BY (toYYYYMMDD(event_date), app_id)
ORDER BY (app_id, company_id, event_date)
SETTINGS index_granularity = 8192;
the select query is like below:
select event_name from test_custom_partition
where event_date >= '2020-07-01 00:00:00' AND event_date <= '2020-07-15 00:00:00'
AND app_id = 'test';
I want to use (toYYYYMMDD(event_date), app_id) as the partition key, as the query could read the minimal data parts. But it could cause the partitions more than 1000, from the document I see
A merge only works for data parts that have the same value for the
partitioning expression. This means you shouldn't make overly granular
partitions (more than about a thousand partitions). Otherwise, the
SELECT query performs poorly because of an unreasonably large number
of files in the file system and open file descriptors.
Or should I use the partition key only toYYYYMMDD(event_date)?
also, could anyone explain why the partition shouldn't more than 1000 partitions? even if the query only use a small set of the data part, it still could cause performance issue?
Thanks

ORA-01722 during CSV data import

I know there are many threads on this, but I am completely stumped (yes, I am a beginner).
Table Definition:
CREATE TABLE BUDGET ( CHANNEL VARCHAR2(26), STORE NUMBER(5), REGION VARCHAR2(26), MONTH_454_SKEY NUMBER(8), SALES_AMOUNT NUMBER(9, 2), SALES_COUNT NUMBER(5), RETURN_AMOUNT NUMBER(10, 2), RETURN_COUNT NUMBER(5), TOTAL_ISSUANCE NUMBER(10, 2), TOTAL_ISSUANCE_COUNT NUMBER(6), FY_WEEK NUMBER(3), FY NUMBER(6))
My table has over 36,000 rows - however I am only receiving this error for random rows. An example of the error rows:
INSERT INTO BUDGET (CHANNEL, STORE, REGION, MONTH_454_SKEY, SALES_AMOUNT, SALES_COUNT, RETURN_AMOUNT, RETURN_COUNT, TOTAL_ISSUANCE, TOTAL_ISSUANCE_COUNT, FY_WEEK, FY) VALUES ('Online',735.0,'SO',201601.0,4310.66,53.0,6108.24,89.0,10418.9,142.0,1.0,2016.0);
INSERT INTO BUDGET (CHANNEL, STORE, REGION, MONTH_454_SKEY, SALES_AMOUNT, SALES_COUNT, RETURN_AMOUNT, RETURN_COUNT, TOTAL_ISSUANCE, TOTAL_ISSUANCE_COUNT, FY_WEEK, FY) VALUES ('Online',738.0,'SO',201601.0,1237.86,21.0,5406.69,53.0,7472.55,74.0,1.0,2016.0);
I understand the meaning of the error, but don't understand why I am getting it. I only have 2 VARCHAR2 fields, 'Channel' and 'Region'. Any help would be greatly appreciated. TIA.
The actual error was occurring on different rows than what was being rejected by Oracle.

Migrating Oracle geometry rows from geodetic to cartesian

I have a table (granule) with about 4 million unique geometry objects that currently have SRID = 8307.
I am trying to create a SECOND table, with the same data, but using a cartesian coordinate system.
I created the table,
create table granule_cartesian (
granule varchar(64) not null,
SHAPE sdo_geometry NOT NULL );
and insert the proper geom_metadata
insert into user_sdo_geom_metadata (table_name, column_name, diminfo, srid)
values ( 'GRANULE_CARTESIAN', 'SHAPE',
mdsys.sdo_dim_array(
mdsys.sdo_dim_element('longitude', -180, 180, .5),
mdsys.sdo_dim_element('latitude', -90, 90, .5)),
null);
And now I want to copy the geometry contents of granule into granule_cartesian.
Obviously, the straight copy won't work because of SRID mismatch.
I can copy a few at a time by converting to wkt and back to geometry, stripping SRID:
insert into granule_cartesian
select granule,
SDO_GEOMETRY(SDO_UTIL.TO_WKTGEOMETRY(shape), null) as shape
from granule
where platform = 'ZZ'; -- granule has a few other columns...
This works if I select a subset of granule table that is less than ~ 10k (about +/-10 minutes). Any more than 10K and the runs for hours, some times ungracefully disconnecting me.
It seems like there should be a way to do this WITHOUT doing <10K chunks. Besides taking FOREVER to actually migrate, this would pose a serious logistical nightmare on our active and dynamic production DB. I've tried using SDO_CS.TRANSFORM like this:
SDO_CS.TRANSFORM(geom => shape, to_srid => null )
... But oracle will not accept a NULL SRID here:
12:57:49 [SELECT - 0 row(s), 0.000 secs] [Error Code: 1405, SQL State: 22002] ORA-01405: fetched column value is NULL
ORA-06512: at "MDSYS.SDO_CS", line 114
ORA-06512: at "MDSYS.SDO_CS", line 152
ORA-06512: at "MDSYS.SDO_CS", line 5588
ORA-06512: at "MDSYS.SDO_CS", line 3064
SDO_CS.TRANSFORM_LAYER will refuse to accept a NULL SRID.
After extensive searching, I cannot find any method to do a streamline geodetic -> cartesian (SRID=NULL) conversion. Does anyone have any ideas besides brute force small batching?
EDITS
1) For Clarity, I understand that I could probably break it up using PL/SQL and do 450 blocks of 10K rows. But # ~470 seconds per block, that is still 2.5 DAYS of execution. And that is a BEST case scenario. Changing projections/coordinate systems using update granule set shape.srid = 8307 is FAST and EASY. Changing coordinate system from cartesian to geodetic using insert into granule select SDO_CS.TRANSFORM(geom => shape, to_srid => 8307 ) .... is FAST and EASY. What I'm looking for is an equally as simple/fast solution to go from geodetic to cartesian.
2) Tried to insert 300K as a test. It ran for approximately 10 hours and died like this:
20:06:59 [INSERT - 0 row(s), 0.000 secs] [Error Code: 4030, SQL State: 61000] ORA-04030: out of process memory when trying to allocate 8080 bytes (joxcx callheap,f:CDUnscanned)
ORA-04030: out of process memory when trying to allocate 8080 bytes (joxcx callheap,f:CDUnscanned)
ORA-04030: out of process memory when trying to allocate 16328 bytes (koh-kghu sessi,kgmtlbdl)
ORA-06512: at "MDSYS.SDO_UTIL", line 2484
ORA-06512: at "MDSYS.SDO_UTIL", line 2511
This is a beefy enterprise level server with nothing but oracle. We recently had a Oracle Consultant (From Oracle) analyze all our DB systems (including this one). It was given a clean bill of health.
Something is wrong with the database. I have geom tables with 64 million rows ( every mapped road in North America - yes, Canada , US, and Mexico ) in them and I routinely perform sdo_anyinteract / sdo_contains queries and get 200 square mile responses in less than 5 seconds.
To do this first of all, drop any and all indexes and turn off logging on the target table or tablespace. If you don't have the permissions ask your DBA but the command is:
alter table [table] nologging ; or alter tablespace [tablesspace] nologging ;
That should keep you for running out of redo space although if you are running out of redo space your DBA should fix that by adding redo segments.
Use a cursor because you have to add the SRID when taking in a WKT since the SRID has to be set on the SDO object.
declare
newGeom sdo_geometry ;
begin
for rec in ( select statement ) loop
newGeom := sdo_util.to_wktgeometry(rec.geom);
newGeom.sdo_srid := [srid that matches the target ] ;
insert into [table] (geom column, ... )values( newGeom, ... );
end loop;
commit ;
end ;
With 4 million rows this should happen in just a few minutes, if not your DB is seriously out of whack.
MAKE SURE YOUR WORK WITH YOUR DBA
When the process finishes then rebuild this domain indexes. That might take a couple of hours. Last time I did it on 64 million rows it took 3 days. You have to understand that R-Trees are essentially indexes within indexes and use minimum bounding rectangles to get the speed and they take a long time to build since each insert represents traversal from the root of the index.
You can use things like BULK COLLECT but that is to complex for this place. I suggest that if you don't already have one, get an Oracle account ( they are free ) and ask questions like this in the Oracle Forums under Database -> spatial
BrianB,
sorry, I just can't understand what are you trying to do with SDO_GEOMETRY(SDO_UTIL.TO_WKTGEOMETRY(shape), null) conversion. If I get it right, the resulting geometry will have the same geometry type, points, segments and ordinates as in source shape.
So, if this is true, you can use one of those:
create table granule_cartesian (
granule varchar(64) not null,
SHAPE sdo_geometry NOT NULL );
insert into granule_cartesian
select granule, shape
from granule
where platform = 'ZZ'; -- granule has a few other columns...
update granule_cartesian t
set t.shape.sdo_srid = null;
insert into user_sdo_geom_metadata (table_name, column_name, diminfo, srid)
values ( 'GRANULE_CARTESIAN', 'SHAPE',
mdsys.sdo_dim_array(
mdsys.sdo_dim_element('longitude', -180, 180, .5),
mdsys.sdo_dim_element('latitude', -90, 90, .5)),
null); -- add metadata after all rows are updated to null srid
or, if for some reason you hate to insert and then update, there is another way:
insert into granule_cartesian
select granule, mdsys.sdo_geometry (t.shape.SDO_GTYPE, null, t.shape.SDO_POINT, t.shape.SDO_ELEM_INFO, t.shape.SDO_ORDINATES)
from granule t
where platform = 'ZZ'; -- granule has a few other columns...
in that case you can have a row in user_sdo_geom_metadata table and even a spatial index before you insert rows into granule_cartesian.
hth. good luck.

ON DELETE CACADE is very slow

I am using Postgres 8.4. My system configuration is window 7 32 bit 4 gb ram and 2.5ghz.
I have a database in Postgres with 10 tables t1, t2, t3, t4, t5.....t10.
t1 has a primary key a sequence id which is a foreign key reference to all other tables.
The data is inserted in database (i.e. in all tables) apart from t1 all other tables have nearly 50,000 rows of data but t1 has one 1 row whose primary key is referenced from all other tables. Then I insert the 2nd row of data in t1 and again 50,000 rows with this new reference in other tables.
The issue is when I want to delete all the data entries that are present in other tables:
delete from t1 where column1='1'
This query takes nearly 10 min to execute.
I created indexes also and tried but the performance is not at all improving.
what can be done?
I have mentioned a sample schema below
CREATE TABLE t1
(
c1 numeric(9,0) NOT NULL,
c2 character varying(256) NOT NULL,
c3ver numeric(4,0) NOT NULL,
dmlastupdatedate timestamp with time zone NOT NULL,
CONSTRAINT t1_pkey PRIMARY KEY (c1),
CONSTRAINT t1_c1_c2_key UNIQUE (c2)
);
CREATE TABLE t2
(
c1 character varying(100),
c2 character varying(100),
c3 numeric(9,0) NOT NULL,
c4 numeric(9,0) NOT NULL,
tver numeric(4,0) NOT NULL,
dmlastupdatedate timestamp with time zone NOT NULL,
CONSTRAINT t2_pkey PRIMARY KEY (c3),
CONSTRAINT t2_fk FOREIGN KEY (c4)
REFERENCES t1 (c1) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE,
CONSTRAINT t2_c3_c4_key UNIQUE (c3, c4)
);
CREATE INDEX t2_index ON t2 USING btree (c4);
Let me know if there is anything wrong with the schema.
With bigger tables and more than just two or three values, you need an index on the referenced column (t1.c1) as well as the referencing columns (t2.c4, ...).
But if your description is accurate, that can not be the cause of the performance problem in your scenario. Since you have only 2 distinct values in t1, there is just no use for an index. A sequential scan will be faster.
Anyway, I re-enacted what you describe in Postgres 9.1.9
CREATE TABLE t1
( c1 numeric(9,0) PRIMARY KEY,
c2 character varying(256) NOT NULL,
c3ver numeric(4,0) NOT NULL,
dmlastupdatedate timestamptz NOT NULL,
CONSTRAINT t1_uni_key UNIQUE (c2)
);
CREATE temp TABLE t2
( c1 character varying(100),
c2 character varying(100),
c3 numeric(9,0) PRIMARY KEY,
c4 numeric(9,0) NOT NULL,
tver numeric(4,0) NOT NULL,
dmlastupdatedate timestamptz NOT NULL,
CONSTRAINT t2_uni_key UNIQUE (c3, c4),
CONSTRAINT t2_c4_fk FOREIGN KEY (c4)
REFERENCES t1(c1) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE
);
INSERT INTO t1 VALUES
(1,'OZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf', 234, now())
,(2,'agdsOZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf', 4564, now());
INSERT INTO t2
SELECT'shOahaZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf'
,'shOahaZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf'
, g, 2, 456, now()
from generate_series (1,50000) g
INSERT INTO t2
SELECT'shOahaZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf'
,'shOahaZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf'
, g, 2, 789, now()
from generate_series (50001, 100000) g
ANALYZE t1;
ANALYZE t2;
EXPLAIN ANALYZE DELETE FROM t1 WHERE c1 = 1;
Total runtime: 53.745 ms
DELETE FROM t1 WHERE c1 = 1;
58 ms execution time.
Ergo, there is nothing fundamentally wrong with your schema layout.
Minor enhancements:
You have a couple of columns defined numeric(9,0) or numeric(4,0). Unless you have a good reason to do that, you are probably a lot better off using just integer. They are smaller and faster overall. You can always add a check constraint if you really need to enforce a maximum.
I also would use text instead of varchar(n)
And reorder columns (at table creation time). As a rule of thumb, place fixed length NOT NULL columns first. Put timestamp and integer first and numeric or text last. More here..

Resources