what are "ReverseTransform" and "MergeTreeReverse" mean in the clickhouse explain pipeline?
table
CREATE TABLE device_info
(
`event_time` DateTime,
`create_time` DateTime DEFAULT now(),
`product_id` String,
`platform` String,
`app_version` String,
`sdk_version` String,
`os_version` Strin,
`model` String,
`device_id` String,
`device_type` String,
`device_cost` Int64,
`data_id` String
)
ENGINE = ReplicatedMergeTree()
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (event_time, product_id, platform, app_version, sdk_version, os_version, model)
TTL event_time + toIntervalDay(7)
SETTINGS index_granularity = 8192
SELECT
event_time AS eventTime,
launch_cost AS launchCost,
device_id,
data_id
FROM device_info
WHERE (device_id = 'xxxxxxx') AND (product_id = 'xxxxxxx') AND (device_type IN ('type1')) AND (event_time >= 'xxxx') AND (event_time <= 'xxxxx')
ORDER BY event_time DESC
explain pipeline
what are "ReverseTransform" and "MergeTreeReverse" mean in the explain pipeline?
explain pipeline
in the trace log I find a statement is
"MergingSortedTransform: Merge sorted 645 blocks,861 rows in 5.927999977 sec. what is the statement mean?
appreciate for your answer! thx
Related
I had a table
CREATE TABLE StatsFull (
Timestamp Int32,
Uid String,
ErrorCode Int32,
Name String,
Version String,
Date Date MATERIALIZED toDate(Timestamp),
Time DateTime MATERIALIZED toDateTime(Timestamp)
) ENGINE = MergeTree() PARTITION BY toMonday(Date)
ORDER BY Time SETTINGS index_granularity = 8192
And I needed to get top 100 Names with unique Uids or top 100 ErrorCodes.
The obvious query is
SELECT Name, uniq(PcId) as cnt FROM StatsFull
WHERE Time > subtractDays(toDate(now()), 1)
GROUP BY Name ORDER BY cnt DESC LIMIT 100
But data was too big so I created an AggregatingMergeTree because I did not need data filtering by hour (just by date).
CREATE MATERIALIZED VIEW StatsAggregated (
Date Date,
ProductName String,
ErrorCode Int32,
Name String,
Version String,
UniqUsers AggregateFunction(uniq, String),
) ENGINE = AggregatingMergeTree() PARTITION BY toMonday(Date)
ORDER BY
(
Date,
ProductName,
ErrorCode,
Name,
Version
) SETTINGS index_granularity = 8192 AS
SELECT
Date,
ProductName,
ErrorCode,
Name,
Version,
uniqState(Uid) AS UniqUsers,
FROM
StatsFull
GROUP BY
Date,
ProductName,
ErrorCode,
Name,
Version
And my current query is:
SELECT Name FROM StatsAggregated
WHERE Date > subtractDays(toDate(now()), 1)
GROUP BY Name
ORDER BY uniqMerge(UniqUsers) DESC LIMIT 100
The query was working fine, however eventually data rows in a day became more and now it too greedy by memory. So I am looking for some optimization.
I have found the function topK(N)(column) that returns an array of the most frequent values in the specified column but it isn't what I need.
I would suggest to the next points:
where possible prefer use SimpleAggregateFunction instead of AggregateFunction
use uniqCombined/uniqCombined64 that "consumes several times less memory" in compare with uniq
reduce the count of dimensions in aggregated-view (it looks like ProductName and Version can be omitted)
CREATE MATERIALIZED VIEW StatsAggregated (
Date Date,
Name String,
ErrorCode Int32
UniqUsers AggregateFunction(uniq, String),
) ENGINE = AggregatingMergeTree()
PARTITION BY toMonday(Date)
ORDER BY (Date, Name, ErrorCode) AS
SELECT Date, Name, ErrorCode, uniqState(Uid) AS UniqUsers,
FROM StatsFull
GROUP BY Date, Name, ErrorCode;
adding extra 'heuristic' constraints to when-clause of resulting query
SELECT Name, uniqMerge(UniqUsers) uniqUsers
FROM StatsAggregated
WHERE Date > subtractDays(toDate(now()), 1)
AND uniqUsers > 12345 /* <-- 12345 is 'heuristic' number that you evaluate based on your data */
AND ErrorCode = 0 /* apply any other conditions to narrow the result set as short as possible */
GROUP BY Name
ORDER BY uniqUsers DESC LIMIT 100
use sampling
/* Raw-table */
CREATE TABLE StatsFull (
/* .. */
) ENGINE = MergeTree()
PARTITION BY toMonday(Date)
SAMPLE BY xxHash32(Uid) /* < -- */
ORDER BY Time, xxHash32(Uid)
/* Applying sampling to raw-table can make faster the short-term queries (period in several hours etc) */
SELECT Name, uniq(PcId) as cnt
FROM StatsFull
SAMPLE 0.05 /* <-- */
WHERE Time > subtractHours(now(), 6) /* <-- hours-period */
GROUP BY Name
ORDER BY cnt DESC LIMIT 100
/* Aggregated-table */
CREATE MATERIALIZED VIEW StatsAggregated (
Date Date,
ProductName String,
ErrorCode Int32,
Name String,
Version String,
UniqUsers AggregateFunction(uniq, String),
) ENGINE = AggregatingMergeTree()
PARTITION BY toMonday(Date)
SAMPLE BY intHash32(toInt32(Date)) /* < -- not sure that is good to choose */
ORDER BY (intHash32(toInt32(Date)), ProductName, ErrorCode, Name, Version)
SELECT /* .. */ FROM StatsFull GROUP BY /* .. */**
/* Applying sampling to aggregated-table can make faster the long-term queries (period in several weeks, months etc) */
SELECT Name
FROM StatsAggregated
SAMPLE 0.1 /* < -- */
WHERE Date > subtractMonths(toDate(now()), 3) /* <-- months-period */
GROUP BY Name
ORDER BY uniqMerge(UniqUsers) DESC LIMIT 100
use distributed query processing. Splitting data into several parts (shards) allows making distributed processing; additional increase of processing performance gives using distributed_group_by_no_merge-query setting.
if you need transpone array to rows you could use arrayJoin
SELECT Name, arrayJoin(topK(100)(Count)) AS top100_Count FROM Stats
I'm starting to learn CH and seem to be running into dead ends while trying to improve my query speed, the table is created like this
CREATE TABLE default.stats(
aa String,
ab String,
user_id UInt16,
ac UInt32,
ad UInt8,
ae UInt8,
created_time DateTime,
created_date Date,
af UInt8,
ag UInt32,
ah UInt32,
ai String,
aj String)
ENGINE = MergeTree
PARTITION BY toYYYYMM(created_time)
ORDER BY(created_time, user_id)
and I'm running a query like so
SELECT ad, created_time, ab, aa, user_id, ac, ag, af
FROM stats
WHERE user_id = 1 AND lowerUTF8(ab) = 'xxxxxxxxx' AND ad != 12
ORDER BY created_time DESC
LIMIT 50 OFFSET 0
this is the result 50 rows in set. Elapsed: 2.881 sec. Processed 74.62 million rows
and if I run the same query without the order part, 50 rows in set. Elapsed: 0.020 sec. Processed 49.15 thousand rows
Why does it seem to process all the rows in the table if in theory the query only has to order around 10k(all the rows returned without the limit) rows? What am I missing and/or how could I improve the speed of CH?
try ORDER BY created_time DESC, user_id
optimize_read_in_order feature were implemented in ClickHouse release 19.14.3.3, 2019-09-10
CH 19.17.4.11
CREATE TABLE stats
(
`aa` String,
`ab` String,
`user_id` UInt16,
`ac` UInt32,
`ad` UInt8,
`ae` UInt8,
`created_time` DateTime,
`created_date` Date,
`af` UInt8,
`ag` UInt32,
`ah` UInt32,
`ai` String,
`aj` String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(created_time)
ORDER BY (created_time, user_id)
insert into stats(created_time, user_id) select toDateTime(intDiv(number,100)), number%103 from numbers(100000000)
SELECT ad, created_time, ab, aa, user_id, ac, ag, af
FROM stats
ORDER BY created_time DESC
LIMIT 5 OFFSET 0
5 rows in set. Elapsed: 0.013 sec. Processed 835.84 thousand rows,
set optimize_read_in_order = 0
SELECT ad, created_time, ab, aa, user_id, ac, ag, af
FROM stats
ORDER BY created_time DESC
LIMIT 5 OFFSET 0
5 rows in set. Elapsed: 0.263 sec. Processed 100.00 million rows
Check a difference
set optimize_read_in_order = 0 VS set optimize_read_in_order = 1
I don't understand why optimize_read_in_order is not working in your case.
I have a table with 3 billion rows, when i do query like this
Select * from tsnew where time > 971128806382 and time <971172006000
limit 100
Its working fine and it takes 0.2 seconds
But when adding order to the query to be like this:
Select * from tsnew where time > 971128806382 and time <971172006000
order by time desc
limit 100
it takes very long time (more than 20 seconds).
create table tsnew(
ext_rec_num Nullable(UInt64),
xdr_id Nullable(UInt64),
xdr_grp Nullable(UInt64),
xdr_type Nullable(UInt64),
xdr_subtype Nullable(Int16),
xdr_direction Nullable(Int16),
xdr_location Nullable(Int16),
time UInt64,
stop_time UInt64,
transaction_duration Nullable(UInt64),
response_time Nullable(UInt64),
protocol Nullable(Int16),
chunk_count Nullable(Int16),
dpc Nullable(Int32),
opc Nullable(Int32),
first_link_id String,
last_dpc Nullable(Int32),
last_opc Nullable(Int32),
last_link_id String,
first_back_opc Nullable(Int32),
first_back_link_id String,
calling_ssn Nullable(Int16),
called_ssn Nullable(Int16),
called_sccp_address String,
calling_party_address String,
response_calling_address String,
root_end_code Nullable(Int32),
root_cause_code Nullable(Int32),
root_cause_pl Nullable(Int16),
root_failure Nullable(Int16),
root_equip Nullable(Int16)
)
ENGINE = MergeTree()
PARTITION BY toInt64(time/3600000)*3600000
order by time
SETTINGS index_granularity = 8192
Can any one help me on this?
It's known issue. Hope it will be merged asap. Subscribe to the PR and upgrade your CH when PR will be merged.
SQlite DB with single table and 60,000,000 records. time to run simple query is more then 100 seconds.
I've tried to switch to postgeSQL but its performance was even less good.
Hadn't test it on mySQL or msSQL.
Shell I split the table (lets say different table for each pointID - there are some hundreds of it? or different table for each month - then I'll have maximum of 10,000,000 records?)
sql scheme:
CREATE TABLE `collectedData` (
`id` INTEGER,
`timeStamp` double,
`timeDateStr` nvarchar,
`pointID` nvarchar,
`pointIDindex` double,
`trendNumber` integer,
`status` nvarchar,
`value` double,
PRIMARY KEY(`id`)
);
CREATE INDEX `idx_pointID` ON `collectedData` (
`pointID`
);
CREATE INDEX `idx_pointIDindex` ON `collectedData` (
`pointIDindex`
);
CREATE INDEX `idx_timeStamp` ON `collectedData` (
`timeStamp`
);
CREATE INDEX `idx_trendNumber` ON `collectedData` (
`trendNumber`
);
Next query took 107 seconds:
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
next query took 150 seconds (less conditions)
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
Editing:
Asnwer from another place - add the next index:
CREATE INDEX idx_All ON collectedData (trendNumber, pointid, pointIDindex, status, timestamp desc, id desc, timeDateStr, value)
had improved performance by factor of 3.
Editing #2: by #Raymond Nijland offer: the execution plan is:
SEARCH TABLE collectedData USING COVERING INDEX idx_All (trendNumber=? AND pointID=?)"
"0" "0" "0" "EXECUTE LIST SUBQUERY 1"
"0" "0" "0" "USE TEMP B-TREE FOR ORDER BY"
and thanks to him - using this data, I've changed the order of the rules in the query to the next:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
this made big improvement (for me it's solved).
After #RaymondNijland had offered me to check the execution plan, I've changed the query to:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
This query gives same results like the other, but is't 120 times faster (decrease the number of records before sorting).
I run the code below everyday and it succeed every time. But when I run it today, there comes an error IndexOutOfBoundsException Index: 3, Size: 3. When I deleted the clause "where member_srl is not null", it succeed. So I don't know what is wrong with my code. And why it can not run today. Thanks.
select member_srl, dt, sessionid , (max(reg_time)-min(reg_time)) as duration, count(reg_time) as click_cnt
from
(
select cast(member_srl as bigint) as member_srl, reg_date as dt, sessionid, cast(SUBSTRING(reg_time,1,2)*3600+SUBSTRING(reg_time,3,2)*60+SUBSTRING(reg_time,5,2) as bigint) as reg_time
from default.daily_session
where member_srl<>'' and dt = '20161009'
union all
select cast(member_srl as bigint) as member_srl, reg_date as dt, sessionid, cast(SUBSTRING(reg_time,1,2)*3600+SUBSTRING(reg_time,3,2)*60+SUBSTRING(reg_time,5,2) as bigint) as reg_time
from default.daily_session_mobile
where member_srl<>'' and dt = '20161009'
union all
select cast(member_srl as bigint) as member_srl, reg_date as dt, sessionid, cast(SUBSTRING(reg_time,1,2)*3600+SUBSTRING(reg_time,3,2)*60+SUBSTRING(reg_time,5,2) as bigint) as reg_time
from default.daily_session_ios
where member_srl<>'' and dt = '20161009'
) base where member_srl is not null
group by member_srl, dt, sessionid
It is a bug, and have been solved in version 1.3.0.
If you are using old version like me, then set hive.optimize.ppd=false; to solve your problem.(deprecated)
My version is Hive 1.1.0-cdh5.6.0
I set hive.optimize.ppd to false, and found that filter on partitioned field do not work,and all those partitions in partitioned table will be read by mapper.
Finally.
I changed my query, use one more subquery solved the problem.