ClickHouse , Issue with select top n and order by primary key - clickhouse

I have a table with 3 billion rows, when i do query like this
Select * from tsnew where time > 971128806382 and time <971172006000
limit 100
Its working fine and it takes 0.2 seconds
But when adding order to the query to be like this:
Select * from tsnew where time > 971128806382 and time <971172006000
order by time desc
limit 100
it takes very long time (more than 20 seconds).
create table tsnew(
ext_rec_num Nullable(UInt64),
xdr_id Nullable(UInt64),
xdr_grp Nullable(UInt64),
xdr_type Nullable(UInt64),
xdr_subtype Nullable(Int16),
xdr_direction Nullable(Int16),
xdr_location Nullable(Int16),
time UInt64,
stop_time UInt64,
transaction_duration Nullable(UInt64),
response_time Nullable(UInt64),
protocol Nullable(Int16),
chunk_count Nullable(Int16),
dpc Nullable(Int32),
opc Nullable(Int32),
first_link_id String,
last_dpc Nullable(Int32),
last_opc Nullable(Int32),
last_link_id String,
first_back_opc Nullable(Int32),
first_back_link_id String,
calling_ssn Nullable(Int16),
called_ssn Nullable(Int16),
called_sccp_address String,
calling_party_address String,
response_calling_address String,
root_end_code Nullable(Int32),
root_cause_code Nullable(Int32),
root_cause_pl Nullable(Int16),
root_failure Nullable(Int16),
root_equip Nullable(Int16)
)
ENGINE = MergeTree()
PARTITION BY toInt64(time/3600000)*3600000
order by time
SETTINGS index_granularity = 8192
Can any one help me on this?

It's known issue. Hope it will be merged asap. Subscribe to the PR and upgrade your CH when PR will be merged.

Related

what are "ReverseTransform" and "MergeTreeReverse" mean in the clickhouse explain pipeline?

what are "ReverseTransform" and "MergeTreeReverse" mean in the clickhouse explain pipeline?
table
CREATE TABLE device_info
(
`event_time` DateTime,
`create_time` DateTime DEFAULT now(),
`product_id` String,
`platform` String,
`app_version` String,
`sdk_version` String,
`os_version` Strin,
`model` String,
`device_id` String,
`device_type` String,
`device_cost` Int64,
`data_id` String
)
ENGINE = ReplicatedMergeTree()
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (event_time, product_id, platform, app_version, sdk_version, os_version, model)
TTL event_time + toIntervalDay(7)
SETTINGS index_granularity = 8192
SELECT
event_time AS eventTime,
launch_cost AS launchCost,
device_id,
data_id
FROM device_info
WHERE (device_id = 'xxxxxxx') AND (product_id = 'xxxxxxx') AND (device_type IN ('type1')) AND (event_time >= 'xxxx') AND (event_time <= 'xxxxx')
ORDER BY event_time DESC
explain pipeline
what are "ReverseTransform" and "MergeTreeReverse" mean in the explain pipeline?
explain pipeline
in the trace log I find a statement is
"MergingSortedTransform: Merge sorted 645 blocks,861 rows in 5.927999977 sec. what is the statement mean?
appreciate for your answer! thx

Get top n rows without order by operator in Clickhouse

I had a table
CREATE TABLE StatsFull (
Timestamp Int32,
Uid String,
ErrorCode Int32,
Name String,
Version String,
Date Date MATERIALIZED toDate(Timestamp),
Time DateTime MATERIALIZED toDateTime(Timestamp)
) ENGINE = MergeTree() PARTITION BY toMonday(Date)
ORDER BY Time SETTINGS index_granularity = 8192
And I needed to get top 100 Names with unique Uids or top 100 ErrorCodes.
The obvious query is
SELECT Name, uniq(PcId) as cnt FROM StatsFull
WHERE Time > subtractDays(toDate(now()), 1)
GROUP BY Name ORDER BY cnt DESC LIMIT 100
But data was too big so I created an AggregatingMergeTree because I did not need data filtering by hour (just by date).
CREATE MATERIALIZED VIEW StatsAggregated (
Date Date,
ProductName String,
ErrorCode Int32,
Name String,
Version String,
UniqUsers AggregateFunction(uniq, String),
) ENGINE = AggregatingMergeTree() PARTITION BY toMonday(Date)
ORDER BY
(
Date,
ProductName,
ErrorCode,
Name,
Version
) SETTINGS index_granularity = 8192 AS
SELECT
Date,
ProductName,
ErrorCode,
Name,
Version,
uniqState(Uid) AS UniqUsers,
FROM
StatsFull
GROUP BY
Date,
ProductName,
ErrorCode,
Name,
Version
And my current query is:
SELECT Name FROM StatsAggregated
WHERE Date > subtractDays(toDate(now()), 1)
GROUP BY Name
ORDER BY uniqMerge(UniqUsers) DESC LIMIT 100
The query was working fine, however eventually data rows in a day became more and now it too greedy by memory. So I am looking for some optimization.
I have found the function topK(N)(column) that returns an array of the most frequent values in the specified column but it isn't what I need.
I would suggest to the next points:
where possible prefer use SimpleAggregateFunction instead of AggregateFunction
use uniqCombined/uniqCombined64 that "consumes several times less memory" in compare with uniq
reduce the count of dimensions in aggregated-view (it looks like ProductName and Version can be omitted)
CREATE MATERIALIZED VIEW StatsAggregated (
Date Date,
Name String,
ErrorCode Int32
UniqUsers AggregateFunction(uniq, String),
) ENGINE = AggregatingMergeTree()
PARTITION BY toMonday(Date)
ORDER BY (Date, Name, ErrorCode) AS
SELECT Date, Name, ErrorCode, uniqState(Uid) AS UniqUsers,
FROM StatsFull
GROUP BY Date, Name, ErrorCode;
adding extra 'heuristic' constraints to when-clause of resulting query
SELECT Name, uniqMerge(UniqUsers) uniqUsers
FROM StatsAggregated
WHERE Date > subtractDays(toDate(now()), 1)
AND uniqUsers > 12345 /* <-- 12345 is 'heuristic' number that you evaluate based on your data */
AND ErrorCode = 0 /* apply any other conditions to narrow the result set as short as possible */
GROUP BY Name
ORDER BY uniqUsers DESC LIMIT 100
use sampling
/* Raw-table */
CREATE TABLE StatsFull (
/* .. */
) ENGINE = MergeTree()
PARTITION BY toMonday(Date)
SAMPLE BY xxHash32(Uid) /* < -- */
ORDER BY Time, xxHash32(Uid)
/* Applying sampling to raw-table can make faster the short-term queries (period in several hours etc) */
SELECT Name, uniq(PcId) as cnt
FROM StatsFull
SAMPLE 0.05 /* <-- */
WHERE Time > subtractHours(now(), 6) /* <-- hours-period */
GROUP BY Name
ORDER BY cnt DESC LIMIT 100
/* Aggregated-table */
CREATE MATERIALIZED VIEW StatsAggregated (
Date Date,
ProductName String,
ErrorCode Int32,
Name String,
Version String,
UniqUsers AggregateFunction(uniq, String),
) ENGINE = AggregatingMergeTree()
PARTITION BY toMonday(Date)
SAMPLE BY intHash32(toInt32(Date)) /* < -- not sure that is good to choose */
ORDER BY (intHash32(toInt32(Date)), ProductName, ErrorCode, Name, Version)
SELECT /* .. */ FROM StatsFull GROUP BY /* .. */**
/* Applying sampling to aggregated-table can make faster the long-term queries (period in several weeks, months etc) */
SELECT Name
FROM StatsAggregated
SAMPLE 0.1 /* < -- */
WHERE Date > subtractMonths(toDate(now()), 3) /* <-- months-period */
GROUP BY Name
ORDER BY uniqMerge(UniqUsers) DESC LIMIT 100
use distributed query processing. Splitting data into several parts (shards) allows making distributed processing; additional increase of processing performance gives using distributed_group_by_no_merge-query setting.
if you need transpone array to rows you could use arrayJoin
SELECT Name, arrayJoin(topK(100)(Count)) AS top100_Count FROM Stats

ClickHouse MergeTree slow SELECT with ORDER BY

I'm starting to learn CH and seem to be running into dead ends while trying to improve my query speed, the table is created like this
CREATE TABLE default.stats(
aa String,
ab String,
user_id UInt16,
ac UInt32,
ad UInt8,
ae UInt8,
created_time DateTime,
created_date Date,
af UInt8,
ag UInt32,
ah UInt32,
ai String,
aj String)
ENGINE = MergeTree
PARTITION BY toYYYYMM(created_time)
ORDER BY(created_time, user_id)
and I'm running a query like so
SELECT ad, created_time, ab, aa, user_id, ac, ag, af
FROM stats
WHERE user_id = 1 AND lowerUTF8(ab) = 'xxxxxxxxx' AND ad != 12
ORDER BY created_time DESC
LIMIT 50 OFFSET 0
this is the result 50 rows in set. Elapsed: 2.881 sec. Processed 74.62 million rows
and if I run the same query without the order part, 50 rows in set. Elapsed: 0.020 sec. Processed 49.15 thousand rows
Why does it seem to process all the rows in the table if in theory the query only has to order around 10k(all the rows returned without the limit) rows? What am I missing and/or how could I improve the speed of CH?
try ORDER BY created_time DESC, user_id
optimize_read_in_order feature were implemented in ClickHouse release 19.14.3.3, 2019-09-10
CH 19.17.4.11
CREATE TABLE stats
(
`aa` String,
`ab` String,
`user_id` UInt16,
`ac` UInt32,
`ad` UInt8,
`ae` UInt8,
`created_time` DateTime,
`created_date` Date,
`af` UInt8,
`ag` UInt32,
`ah` UInt32,
`ai` String,
`aj` String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(created_time)
ORDER BY (created_time, user_id)
insert into stats(created_time, user_id) select toDateTime(intDiv(number,100)), number%103 from numbers(100000000)
SELECT ad, created_time, ab, aa, user_id, ac, ag, af
FROM stats
ORDER BY created_time DESC
LIMIT 5 OFFSET 0
5 rows in set. Elapsed: 0.013 sec. Processed 835.84 thousand rows,
set optimize_read_in_order = 0
SELECT ad, created_time, ab, aa, user_id, ac, ag, af
FROM stats
ORDER BY created_time DESC
LIMIT 5 OFFSET 0
5 rows in set. Elapsed: 0.263 sec. Processed 100.00 million rows
Check a difference
set optimize_read_in_order = 0 VS set optimize_read_in_order = 1
I don't understand why optimize_read_in_order is not working in your case.

Frequency Histogram in Clickhouse with unique and non unique data

I have a event table with created_at(DateTime), userid(String), eventid(String) column. Here userid can be repetitive while eventid is always unique uuid.
I am looking to build both unique and non unique frequency histogram.
This is for both eventid and userid on basis of given three input
start_datetime
end_datetime and
interval (1 min, 1 hr, 1 day, 7 day, 1 month).
Here, bucket will be decided by (end_datetime - start_datetime)/interval.
Output comes as start_datetime, end_datetime and frequency.
For any interval, if data is not available then start_datetime and end_datetime comes but with frequency as 0.
How can I build a generic query for this?
I looked in histogram function but could not find any documentation for this. While trying it, i could not understand relation behind the input and output.
count(distinct XXX) is deprecated.
More useful uniq(XXX) or uniqExact(XXX)
I got it work using following. Here, toStartOfMonth can be changed to other similar functions in CH.
select toStartOfMonth(`timestamp`) interval_data , count(distinct uid) count_data
from g94157d29.event1
where `timestamp` >= toDateTime('2018-11-01 00:00:00') and `timestamp` <= toDateTime('2018-12-31 00:00:00')
GROUP BY interval_data;
and
select toStartOfMonth(`timestamp`) interval_data , count(*) count_data
from g94157d29.event1
where `timestamp` >= toDateTime('2018-11-01 00:00:00') and `timestamp` <= toDateTime('2018-12-31 00:00:00')
GROUP BY interval_data;
But performance is very low for >2 billion records each month in event table where toYYYYMM(timestamp) is partition and toYYYYMMDD(timestamp) is order by.
Distinct count query takes > 30GB of space and 30 sec of time. Yet didn't complete.
While, General count query takes 10-20 sec to complete.

How can i improce performance of sqlite with large table?

SQlite DB with single table and 60,000,000 records. time to run simple query is more then 100 seconds.
I've tried to switch to postgeSQL but its performance was even less good.
Hadn't test it on mySQL or msSQL.
Shell I split the table (lets say different table for each pointID - there are some hundreds of it? or different table for each month - then I'll have maximum of 10,000,000 records?)
sql scheme:
CREATE TABLE `collectedData` (
`id` INTEGER,
`timeStamp` double,
`timeDateStr` nvarchar,
`pointID` nvarchar,
`pointIDindex` double,
`trendNumber` integer,
`status` nvarchar,
`value` double,
PRIMARY KEY(`id`)
);
CREATE INDEX `idx_pointID` ON `collectedData` (
`pointID`
);
CREATE INDEX `idx_pointIDindex` ON `collectedData` (
`pointIDindex`
);
CREATE INDEX `idx_timeStamp` ON `collectedData` (
`timeStamp`
);
CREATE INDEX `idx_trendNumber` ON `collectedData` (
`trendNumber`
);
Next query took 107 seconds:
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
next query took 150 seconds (less conditions)
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
Editing:
Asnwer from another place - add the next index:
CREATE INDEX idx_All ON collectedData (trendNumber, pointid, pointIDindex, status, timestamp desc, id desc, timeDateStr, value)
had improved performance by factor of 3.
Editing #2: by #Raymond Nijland offer: the execution plan is:
SEARCH TABLE collectedData USING COVERING INDEX idx_All (trendNumber=? AND pointID=?)"
"0" "0" "0" "EXECUTE LIST SUBQUERY 1"
"0" "0" "0" "USE TEMP B-TREE FOR ORDER BY"
and thanks to him - using this data, I've changed the order of the rules in the query to the next:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
this made big improvement (for me it's solved).
After #RaymondNijland had offered me to check the execution plan, I've changed the query to:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
This query gives same results like the other, but is't 120 times faster (decrease the number of records before sorting).

Resources