I want to know what's the best practice for the partition key.
In my project, we have a table with event_date, app_id and other columns. The app_id will be growing and could be thousands.
The select query is based on event_date and app_id.
The simple data schema is as below:
CREATE TABLE test.test_custom_partition (
company_id UInt64,
app_id String,
event_date DateTime,
event_name String ) ENGINE MergeTree()
PARTITION BY (toYYYYMMDD(event_date), app_id)
ORDER BY (app_id, company_id, event_date)
SETTINGS index_granularity = 8192;
the select query is like below:
select event_name from test_custom_partition
where event_date >= '2020-07-01 00:00:00' AND event_date <= '2020-07-15 00:00:00'
AND app_id = 'test';
I want to use (toYYYYMMDD(event_date), app_id) as the partition key, as the query could read the minimal data parts. But it could cause the partitions more than 1000, from the document I see
A merge only works for data parts that have the same value for the
partitioning expression. This means you shouldn't make overly granular
partitions (more than about a thousand partitions). Otherwise, the
SELECT query performs poorly because of an unreasonably large number
of files in the file system and open file descriptors.
Or should I use the partition key only toYYYYMMDD(event_date)?
also, could anyone explain why the partition shouldn't more than 1000 partitions? even if the query only use a small set of the data part, it still could cause performance issue?
Thanks
Related
I have an oracle table. Table's DDL is (not have the primary key)
create table CLIENT_ACCOUNT
(
CLIENT_ID VARCHAR2(18) default ' ' not null,
ACCOUNT_ID VARCHAR2(18) default ' ' not null,
......
)
create unique index UK_ACCOUNT
on CLIENT_ACCOUNT (CLIENT_ID, ACCOUNT_ID)
Then, the data's scale is very huge, maybe 100M records. I want to traverse this whole table's data with batch.
Now, I use the table's index to batch traverse. But I have some oracle grammar problems.
# I want to use this SQL, but grammar error.
# try to use b-tree's index to locate start position, but not work
select * from CLIENT_ACCOUNT
WHERE (CLIENT_ID, ACCOUNT_ID) > (1,2)
AND ROWNUM < 1000
ORDER BY CLIENT_ID, ACCOUNT_ID
Has the fastest way to batch touch table data?
Wild guess:
select * from CLIENT_ACCOUNT
WHERE CLIENT_ID > '1'
and ACCOUNT_ID > '2'
AND ROWNUM < 1000;
It would at least compile, although whether it correctly implements your business logic is a different matter. Note that I have cast your filter criteria to strings. This is because your columns have a string datatype and you are defaulting them to spaces, so there's a high probability those columns contain non-numeric values.
If this doesn't solve your problem, please edit your question with more details; sample input data and expected output is always helpful in these situations.
Your data model seems odd.
Your columns are defined as varchar2. So why is your criteria numeric?
Also, why do you default the key columns to space? It would be better to leave unpopulated values as null. (To be clear, NULL is not a good thing in an indexed column, it's just better than a space.)
I created a table and two materialized views recursively.
Table:
CREATE TABLE `log_details` (
date String,
event_time DateTime,
username String,
city String)
ENGINE = MergeTree()
ORDER BY (date, event_time)
PARTITION BY date TTL event_time + INTERVAL 1 MONTH
Materialized views:
CREATE MATERIALIZED VIEW `log_u_c_day_mv`
ENGINE = SummingMergeTree()
PARTITION BY date
ORDER BY (date, username, city)
AS
SELECT date, username, city, count() as times
FROM `log_details`
GROUP BY date, username, city
CREATE MATERIALIZED VIEW `log_u_day_mv`
ENGINE = SummingMergeTree()
PARTITION BY date
ORDER BY (date, username)
AS
SELECT date, username, SUM(times) as total_times
FROM `.inner.log_u_c_day_mv`
GROUP BY date, username
Insert into log_details → Insert into log_u_c_day_mv → Insert into log_u_day_mv.
log_u_day_mv is not be optimized after 15 minutes inserting log_u_c_day_mv even over one day.
I tried to optimize log_u_day_mv manually and it works.
OPTIMIZE TABLE `.inner.log_u_day_mv` PARTITION 20210110
But ClickHouse does not timely optimize it.
How to solve it?
Data always is not fully aggregated/collapsed in MT.
If you do optimize final the next insert into creates a new part.
CH does not merge parts by time. Merge scheduler selects parts by own algorithm based on the current node workload / number of parts / size of parts.
SummingMT MUST BE QUERIED with sum / groupby ALWAYS.
select sum(times), username
from log_u_day_mv
group by username
DO NOT USE from log_u_day_mv FINAL it reads excessive columns!!!!!!!!!!!!!!
I'm trying to migrate one of my Postgres tables at ClickHouse. Here what I came up with at ClickHouse:
CREATE TABLE loads(
country_id UInt16,
partner_id UInt32,
is_unique UInt8,
ip String,
created_at DateTime
) ENGINE=MergeTree PARTITION BY toYYYYMM(created_at) ORDER BY (created_at);
is_unique here is a Boolean with 0 or 1. I wanna know count for aggregates: country_id, partner_id and created_at, but also I wanna know how much from these loads are unique loads. At Postgres it looks like:
SELECT
count(*) AS loads,
count(*) FILTER (WHERE is_unique) AS uniq,
country_id,
partner_id,
created_at::date AS ts
FROM loads
GROUP BY ts, country_id, partner_id
Is it possible at ClickHouse or should I think again about how to aggregate the data? I didn't find any clues at manual except count can get expr instead of asterisk, but count(is_unique = 1) doesn't work and just returns the same amount as count(*).
I just found an answer in minutes after posting:
SELECT count(*), countIf(is_unique = 1) /* .. */
Good luck.
Table myfirst3 have 4 columns and 1.2 million records.
Table mtl_object_genealogy has over 10 million records.
Running the below code takes very long time. How to tune this code using with options?
WITH level1 as (
SELECT mln_parent.lot_number,
mln_parent.inventory_item_id,
gen.lot_num ,--fg_lot,
gen.segment1,
gen.rcv_date.
FROM mtl_lot_numbers mln_parent,
(SELECT MOG1.parent_object_id,
p.segment1,
p.lot_num,
p.rcv_date
FROM mtl_object_genealogy MOG1 ,
myfirst3 p
START WITH MOG1.object_id = p.gen_object_id
AND (MOG1.end_date_active IS NULL OR MOG1.end_date_active > SYSDATE)
CONNECT BY nocycle PRIOR MOG1.parent_object_id = MOG1.object_id
AND (MOG1.end_date_active IS NULL OR MOG1.end_date_active > SYSDATE)
UNION all
SELECT p1.gen_object_id,
p1.segment1,
p1.lot_num,
p1.rcv_date
FROM myfirst3 p1 ) gen
WHERE mln_parent.gen_object_id = gen.parent_object_id )
select /*+ NO_CPU_COSTING */ *
from level1;
execution plan
CREATE TABLE APPS.MYFIRST3
(
TO_ORGANIZATION_ID NUMBER,
LOT_NUM VARCHAR2(80 BYTE),
ITEM_ID NUMBER,
FROM_ORGANIZATION_ID NUMBER,
GEN_OBJECT_ID NUMBER,
SEGMENT1 VARCHAR2(40 BYTE),
RCV_DATE DATE
);
CREATE TABLE INV.MTL_OBJECT_GENEALOGY
(
OBJECT_ID NUMBER NOT NULL,
OBJECT_TYPE NUMBER NOT NULL,
PARENT_OBJECT_ID NUMBER NOT NULL,
START_DATE_ACTIVE DATE NOT NULL,
END_DATE_ACTIVE DATE,
GENEALOGY_ORIGIN NUMBER,
ORIGIN_TXN_ID NUMBER,
GENEALOGY_TYPE NUMBER,
);
CREATE INDEX INV.MTL_OBJECT_GENEALOGY_N1 ON INV.MTL_OBJECT_GENEALOGY(OBJECT_ID);
CREATE INDEX INV.MTL_OBJECT_GENEALOGY_N2 ON INV.MTL_OBJECT_GENEALOGY(PARENT_OBJECT_ID);
Your explain plan shows some very big numbers. The optimizer reckons the final result set will be about 3227,000,000,000 rows. Just returning that many rows will take some time.
All table accesses are Full Table Scans. As you have big tables that will eat time too.
As for improvements, it's pretty hard to for us understand the logic of your query. This is your data model, you business rules, your data. You haven't explained anything so all we can do is guess.
Why are you using the WITH clause? You only use the level result set once, so just have a regular FROM clause.
Why are you using UNION ALL? That operation just duplicates the records retrieved from myfirst3 ( all those values are already included as rows where MOG1.object_id = p.gen_object_id.
The MERGE JOIN CARTESIAN operation is interesting. Oracle uses it to implement transitive closure. It is an expensive operation but that's because treewalking a hierarchy is an expensive thing to do. It is unfortunate for you that you are generating all the parent-child relationships for a table with 27 million records. That's bad.
The full table scans aren't the problem. There are no filters on myfirst3 so obviously the database has to get all the records. If there is one parent for each myfirst3 record that's 10% of the contents mtl_object_genealogy so a full table scan would be efficient; but you're rolling up the entire hierarchy so it's like you're looking at a much greater chunk of the table.
Your indexes are irrelevant in the face of such numbers. What might help is a composite index on mtl_object_genealogy(OBJECT_ID, PARENT_OBJECT_ID, END_DATE_ACTIVE).
You want all the levels of PARENT_OBJECT_ID for the records in myfirst3. If you run this query often and mtl_object_genealogy is a slowly changing table you should consider materializing the transitive closure into a table which just has records for all the permutations of leaf records and parents.
To sum up:
Ditch the WITH clause
Drop the UNION ALL
Tune the tree-walk with a composite index (or materializing it)
Need help query performance.
I have a table A joining to a view and it is taking 7 seconds to get the results. But when i do select query on view i get the results in 1 seconds.
I have created the indexes on the table A. But there is no improvements in the query.
SELECT
ITEM_ID, BARCODE, CONTENT_TYPE_CODE, DEPARTMENT, DESCRIPTION, ITEM_NUMBER, FROM_DATE,
TO_DATE, CONTACT_NAME, FILE_LOCATION, FILE_LOCATION_UPPER, SOURCE_LOCATION,
DESTRUCTION_DATE, SOURCE, LABEL_NAME, ARTIST_NAME, TITLE, SELECTION_NUM, REP_IDENTIFIER,
CHECKED_OUT
FROM View B,
table A
where B.item_id=A.itemid
and status='VALID'
AND session_id IN ('naveen13122016095800')
ORDER BY item_id,barcode;
CREATE TABLE A
(
ITEMID NUMBER,
USER_NAME VARCHAR2(25 BYTE),
CREATE_DATE DATE,
SESSION_ID VARCHAR2(240 BYTE),
STATUS VARCHAR2(20 BYTE)
)
CREATE UNIQUE INDEX A_IDX1 ON A(ITEMID);
CREATE INDEX A_IDX2 ON A(SESSION_ID);
CREATE INDEX A_IDX3 ON A(STATUS);'
So querying the view joined to a table is slower than querying the view alone? This is not surprising, is it?
Anyway, it doesn't make much sense to create separate indexes on the fields. The DBMS will pick one index (if any) to access the table. You can try a composed index:
CREATE UNIQUE INDEX A_IDX4 ON A(status, session_id, itemid);
But the DBMS will still only use this index when it sees an advantage in this over simply reading the full table. That means, if the DBMS expects to have to read a big amount of records anyway, it won't indirectly access them via the index.
At last two remarks concerning your query:
Don't use those out-dated comma-separated joins. They are less readable and more prone to errors than explicit ANSI joins (FROM View B JOIN table A ON B.item_id = A.itemid).
Use qualifiers for all columns when working with more than one table or view in your query (and A.status='VALID' ...).
UPDATE: I see now, that you are not selecting any columns from the table, so why join it at all? It seems you are merely looking up whether a record exists in the table, so use EXISTS or IN accordingly. (This may not make it faster, but a lot more readable at least.)
SELECT
ITEM_ID, BARCODE, CONTENT_TYPE_CODE, DEPARTMENT, DESCRIPTION, ITEM_NUMBER, FROM_DATE,
TO_DATE, CONTACT_NAME, FILE_LOCATION, FILE_LOCATION_UPPER, SOURCE_LOCATION,
DESTRUCTION_DATE, SOURCE, LABEL_NAME, ARTIST_NAME, TITLE, SELECTION_NUM, REP_IDENTIFIER,
CHECKED_OUT
FROM View
WHERE itemid IN
(
SELECT itemid
FROM A
WHERE status = 'VALID'
AND session_id IN ('naveen13122016095800')
)
ORDER BY item_id, barcode;