ClickHouse MergeTree slow SELECT with ORDER BY

ClickHouse MergeTree slow SELECT with ORDER BY - clickhouse

I'm starting to learn CH and seem to be running into dead ends while trying to improve my query speed, the table is created like this
CREATE TABLE default.stats(
aa String,
ab String,
user_id UInt16,
ac UInt32,
ad UInt8,
ae UInt8,
created_time DateTime,
created_date Date,
af UInt8,
ag UInt32,
ah UInt32,
ai String,
aj String)
ENGINE = MergeTree
PARTITION BY toYYYYMM(created_time)
ORDER BY(created_time, user_id)
and I'm running a query like so
SELECT ad, created_time, ab, aa, user_id, ac, ag, af
FROM stats
WHERE user_id = 1 AND lowerUTF8(ab) = 'xxxxxxxxx' AND ad != 12
ORDER BY created_time DESC
LIMIT 50 OFFSET 0
this is the result 50 rows in set. Elapsed: 2.881 sec. Processed 74.62 million rows
and if I run the same query without the order part, 50 rows in set. Elapsed: 0.020 sec. Processed 49.15 thousand rows
Why does it seem to process all the rows in the table if in theory the query only has to order around 10k(all the rows returned without the limit) rows? What am I missing and/or how could I improve the speed of CH?

try ORDER BY created_time DESC, user_id
optimize_read_in_order feature were implemented in ClickHouse release 19.14.3.3, 2019-09-10

CH 19.17.4.11
CREATE TABLE stats
(
`aa` String,
`ab` String,
`user_id` UInt16,
`ac` UInt32,
`ad` UInt8,
`ae` UInt8,
`created_time` DateTime,
`created_date` Date,
`af` UInt8,
`ag` UInt32,
`ah` UInt32,
`ai` String,
`aj` String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(created_time)
ORDER BY (created_time, user_id)
insert into stats(created_time, user_id) select toDateTime(intDiv(number,100)), number%103 from numbers(100000000)
SELECT ad, created_time, ab, aa, user_id, ac, ag, af
FROM stats
ORDER BY created_time DESC
LIMIT 5 OFFSET 0
5 rows in set. Elapsed: 0.013 sec. Processed 835.84 thousand rows,
set optimize_read_in_order = 0
SELECT ad, created_time, ab, aa, user_id, ac, ag, af
FROM stats
ORDER BY created_time DESC
LIMIT 5 OFFSET 0
5 rows in set. Elapsed: 0.263 sec. Processed 100.00 million rows
Check a difference
set optimize_read_in_order = 0 VS set optimize_read_in_order = 1
I don't understand why optimize_read_in_order is not working in your case.

Related

what are "ReverseTransform" and "MergeTreeReverse" mean in the clickhouse explain pipeline?

what are "ReverseTransform" and "MergeTreeReverse" mean in the clickhouse explain pipeline?
table
CREATE TABLE device_info
(
`event_time` DateTime,
`create_time` DateTime DEFAULT now(),
`product_id` String,
`platform` String,
`app_version` String,
`sdk_version` String,
`os_version` Strin,
`model` String,
`device_id` String,
`device_type` String,
`device_cost` Int64,
`data_id` String
)
ENGINE = ReplicatedMergeTree()
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (event_time, product_id, platform, app_version, sdk_version, os_version, model)
TTL event_time + toIntervalDay(7)
SETTINGS index_granularity = 8192
SELECT
event_time AS eventTime,
launch_cost AS launchCost,
device_id,
data_id
FROM device_info
WHERE (device_id = 'xxxxxxx') AND (product_id = 'xxxxxxx') AND (device_type IN ('type1')) AND (event_time >= 'xxxx') AND (event_time <= 'xxxxx')
ORDER BY event_time DESC
explain pipeline
what are "ReverseTransform" and "MergeTreeReverse" mean in the explain pipeline?
explain pipeline
in the trace log I find a statement is
"MergingSortedTransform: Merge sorted 645 blocks,861 rows in 5.927999977 sec. what is the statement mean?
appreciate for your answer! thx

ClickHouse , Issue with select top n and order by primary key

I have a table with 3 billion rows, when i do query like this
Select * from tsnew where time > 971128806382 and time <971172006000
limit 100
Its working fine and it takes 0.2 seconds
But when adding order to the query to be like this:
Select * from tsnew where time > 971128806382 and time <971172006000
order by time desc
limit 100
it takes very long time (more than 20 seconds).
create table tsnew(
ext_rec_num Nullable(UInt64),
xdr_id Nullable(UInt64),
xdr_grp Nullable(UInt64),
xdr_type Nullable(UInt64),
xdr_subtype Nullable(Int16),
xdr_direction Nullable(Int16),
xdr_location Nullable(Int16),
time UInt64,
stop_time UInt64,
transaction_duration Nullable(UInt64),
response_time Nullable(UInt64),
protocol Nullable(Int16),
chunk_count Nullable(Int16),
dpc Nullable(Int32),
opc Nullable(Int32),
first_link_id String,
last_dpc Nullable(Int32),
last_opc Nullable(Int32),
last_link_id String,
first_back_opc Nullable(Int32),
first_back_link_id String,
calling_ssn Nullable(Int16),
called_ssn Nullable(Int16),
called_sccp_address String,
calling_party_address String,
response_calling_address String,
root_end_code Nullable(Int32),
root_cause_code Nullable(Int32),
root_cause_pl Nullable(Int16),
root_failure Nullable(Int16),
root_equip Nullable(Int16)
)
ENGINE = MergeTree()
PARTITION BY toInt64(time/3600000)*3600000
order by time
SETTINGS index_granularity = 8192
Can any one help me on this?

It's known issue. Hope it will be merged asap. Subscribe to the PR and upgrade your CH when PR will be merged.

How can i improce performance of sqlite with large table?

SQlite DB with single table and 60,000,000 records. time to run simple query is more then 100 seconds.
I've tried to switch to postgeSQL but its performance was even less good.
Hadn't test it on mySQL or msSQL.
Shell I split the table (lets say different table for each pointID - there are some hundreds of it? or different table for each month - then I'll have maximum of 10,000,000 records?)
sql scheme:
CREATE TABLE `collectedData` (
`id` INTEGER,
`timeStamp` double,
`timeDateStr` nvarchar,
`pointID` nvarchar,
`pointIDindex` double,
`trendNumber` integer,
`status` nvarchar,
`value` double,
PRIMARY KEY(`id`)
);
CREATE INDEX `idx_pointID` ON `collectedData` (
`pointID`
);
CREATE INDEX `idx_pointIDindex` ON `collectedData` (
`pointIDindex`
);
CREATE INDEX `idx_timeStamp` ON `collectedData` (
`timeStamp`
);
CREATE INDEX `idx_trendNumber` ON `collectedData` (
`trendNumber`
);
Next query took 107 seconds:
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
next query took 150 seconds (less conditions)
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
Editing:
Asnwer from another place - add the next index:
CREATE INDEX idx_All ON collectedData (trendNumber, pointid, pointIDindex, status, timestamp desc, id desc, timeDateStr, value)
had improved performance by factor of 3.
Editing #2: by #Raymond Nijland offer: the execution plan is:
SEARCH TABLE collectedData USING COVERING INDEX idx_All (trendNumber=? AND pointID=?)"
"0" "0" "0" "EXECUTE LIST SUBQUERY 1"
"0" "0" "0" "USE TEMP B-TREE FOR ORDER BY"
and thanks to him - using this data, I've changed the order of the rules in the query to the next:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
this made big improvement (for me it's solved).

After #RaymondNijland had offered me to check the execution plan, I've changed the query to:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
This query gives same results like the other, but is't 120 times faster (decrease the number of records before sorting).

Oracle - Insert x amount of rows with random data

I am currently doing some testing and am in the need for a large amount of data (around 1 million rows)
I am using the following table:
CREATE TABLE OrderTable(
OrderID INTEGER NOT NULL,
StaffID INTEGER,
TotalOrderValue DECIMAL (8,2)
CustomerID INTEGER);
ALTER TABLE OrderTable ADD CONSTRAINT OrderID_PK PRIMARY KEY (OrderID)
CREATE SEQUENCE seq_OrderTable
MINVALUE 1
START WITH 1
INCREMENT BY 1
CACHE 10000;
and want to randomly insert 1000000 rows into it with the following rules:
OrderID needs to be be sequential (1, 2, 3 etc...)
StaffID needs to be a random number between 1 and 1000
CustomerID needs to be a random number between 1 and 10000
TotalOrderValue needs to be a random decimal value between 0.00 and 9999.99
Is this even possible to do? I can I could generate all of these using this update statement? however generating a million rows in 1 go I am not sure on how to do this
Thanks for any help on this matter
This is how i would randomly generate the number on update:
UPDATE StaffTable SET DepartmentID = DBMS_RANDOM.value(low => 1, high => 5);

For testing purposes I created the table and populated it in one shot, with this query:
CREATE TABLE OrderTable(OrderID, StaffID, CustomerID, TotalOrderValue)
as (select level, ceil(dbms_random.value(0, 1000)),
ceil(dbms_random.value(0,10000)),
round(dbms_random.value(0,10000),2)
from dual
connect by level <= 1000000)
/
A few notes - it is better to use NUMBER as data type, NUMBER(8,2) is the format for decimal. It is much more efficient for populating this kind of table to use the "hierarchical query without PRIOR" trick (the "connect by level <= ..." trick) to get the order ID's.
If your table is created already, insert into OrderTable (select level...) (same subquery as in my code) should work just as well. You may be better off adding the PK constraint only after you create the data though, so as not to slow things down.
A small sample from the table created (total time to create the table on my cheap laptop - 1,000,000 rows - was 7.6 seconds):
SQL> select * from OrderTable where orderid between 500020 and 500030;
ORDERID STAFFID CUSTOMERID TOTALORDERVALUE
---------- ---------- ---------- ---------------
500020 666 879 6068.63
500021 189 6444 1323.82
500022 533 2609 1847.21
500023 409 895 207.88
500024 80 2125 1314.13
500025 247 3772 5081.62
500026 922 9523 1160.38
500027 818 5197 5009.02
500028 393 6870 5067.81
500029 358 4063 858.44
500030 316 8134 3479.47

Min(), Max() within date subset

Not sure if the title fits, but here's my problem:
I have the following table:
create table OpenTrades(
AccountNumber number,
SnapshotTime date,
Ticket number,
OpenTime date,
TradeType varchar2(4),
TradeSize number,
TradeItem char(6),
OpenPrice number,
CurrentAsk number,
CurrentBid number,
TradeSL number,
TradeTP number,
TradeSwap number,
TradeProfit number
);
alter table OpenTrades add constraint OpenTrades_PK Primary Key (AccountNumber, SnapshotTime, Ticket) using index tablespace MyNNIdx;
For every (SnapshotTime, account), I want to select min(OpenPrice), max(OpenPrice) in such a way that the resultimg min and max are relative to the past only, with respect to SnapshotTime.
For instance, for any possible (account, tradeitem) pair, I may have 10 records with, say, Snapshottime=10-jun and openprice between 0.9 and 2.0, as well as 10 more records with SnapshotTime=11-jun and openprice between 1.0 and 2.1, as well as 10 more records with SnapshotTime=12-jun and openprice between 0.7 and 1.9.
In such scenario, the sought query should return something like this:
AccountNumber SnapshotTime MyMin MyMax
------------- ------------ ----- -----
1234567 10-jun 0.9 2.0
1234567 11-jun 0.9 2.1
1234567 12-jun 0.7 2.1
I've already tried this, but it only returns min() and max() within the same snapshottime:
select accountnumber, snapshottime, tradeitem, min(openprice), max(openprice)
from opentrades
group by accountnumber, snapshottime, tradeitem
Any help would be appreciated.

You can use the analytic versions of min() and max() for this, along with windowing clauses:
select distinct accountnumber, snapshottime, tradeitem,
min(openprice) over (partition by accountnumber, tradeitem
order by snapshottime, openprice
rows between unbounded preceding and current row) as min_openprice,
max(openprice) over (partition by accountnumber, tradeitem
order by snapshottime, openprice desc
rows between unbounded preceding and current row) as max_openprice
from opentrades
order by accountnumber, snapshottime, tradeitem;
ACCOUNTNUMBER SNAPSHOTTIME TRADEITEM MIN_OPENPRICE MAX_OPENPRICE
------------- ------------ --------- ------------- -------------
1234567 10-JUN-14 X .9 2
1234567 11-JUN-14 X .9 2.1
1234567 12-JUN-14 X .7 2.1
SQL Fiddle.
The partition by calculates the value for the current accountnumber and tradeitem, within the subset of rows based on the rows between clause; the order by means that it only looks at rows in any previous snapshot and up to the lowest (for min) or highest (for max, because of the desc) in the current snapshot, when calculating the appropriate min/max for each row.
The analytic result is calculated for every row. If you run it without the distinct then you see all your base data plus the same min/max for each snapshot (Fiddle). As you don't want any of the varying data you can suppress the duplication with distinct, or by making it a query with a row_number() that you then filter on, etc.

Does this answer your problem ?
select ot1.accountnumber, ot1.snapshottime, ot1.tradeitem,
min(ot2.openprice), max(ot2.openprice)
from opentrades ot1, opentrades ot2
where ot2.accountnumber = ot1.accountnumber
and ot2.tradeitem = ot1.tradeitem
and ot2.snapshottime <= ot1.snapshottime
group by ot1.accountnumber, ot1.snapshottime, ot1.tradeitem

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

ClickHouse MergeTree slow SELECT with ORDER BY - clickhouse

try ORDER BY created_time DESC, user_id optimize_read_in_order feature were implemented in ClickHouse release 19.14.3.3, 2019-09-10

Related

what are "ReverseTransform" and "MergeTreeReverse" mean in the clickhouse explain pipeline?

ClickHouse , Issue with select top n and order by primary key

How can i improce performance of sqlite with large table?

Oracle - Insert x amount of rows with random data

Min(), Max() within date subset

Categories

Resources