Get top n rows without order by operator in Clickhouse - clickhouse

I had a table
CREATE TABLE StatsFull (
Timestamp Int32,
Uid String,
ErrorCode Int32,
Name String,
Version String,
Date Date MATERIALIZED toDate(Timestamp),
Time DateTime MATERIALIZED toDateTime(Timestamp)
) ENGINE = MergeTree() PARTITION BY toMonday(Date)
ORDER BY Time SETTINGS index_granularity = 8192
And I needed to get top 100 Names with unique Uids or top 100 ErrorCodes.
The obvious query is
SELECT Name, uniq(PcId) as cnt FROM StatsFull
WHERE Time > subtractDays(toDate(now()), 1)
GROUP BY Name ORDER BY cnt DESC LIMIT 100
But data was too big so I created an AggregatingMergeTree because I did not need data filtering by hour (just by date).
CREATE MATERIALIZED VIEW StatsAggregated (
Date Date,
ProductName String,
ErrorCode Int32,
Name String,
Version String,
UniqUsers AggregateFunction(uniq, String),
) ENGINE = AggregatingMergeTree() PARTITION BY toMonday(Date)
ORDER BY
(
Date,
ProductName,
ErrorCode,
Name,
Version
) SETTINGS index_granularity = 8192 AS
SELECT
Date,
ProductName,
ErrorCode,
Name,
Version,
uniqState(Uid) AS UniqUsers,
FROM
StatsFull
GROUP BY
Date,
ProductName,
ErrorCode,
Name,
Version
And my current query is:
SELECT Name FROM StatsAggregated
WHERE Date > subtractDays(toDate(now()), 1)
GROUP BY Name
ORDER BY uniqMerge(UniqUsers) DESC LIMIT 100
The query was working fine, however eventually data rows in a day became more and now it too greedy by memory. So I am looking for some optimization.
I have found the function topK(N)(column) that returns an array of the most frequent values in the specified column but it isn't what I need.

I would suggest to the next points:
where possible prefer use SimpleAggregateFunction instead of AggregateFunction
use uniqCombined/uniqCombined64 that "consumes several times less memory" in compare with uniq
reduce the count of dimensions in aggregated-view (it looks like ProductName and Version can be omitted)
CREATE MATERIALIZED VIEW StatsAggregated (
Date Date,
Name String,
ErrorCode Int32
UniqUsers AggregateFunction(uniq, String),
) ENGINE = AggregatingMergeTree()
PARTITION BY toMonday(Date)
ORDER BY (Date, Name, ErrorCode) AS
SELECT Date, Name, ErrorCode, uniqState(Uid) AS UniqUsers,
FROM StatsFull
GROUP BY Date, Name, ErrorCode;
adding extra 'heuristic' constraints to when-clause of resulting query
SELECT Name, uniqMerge(UniqUsers) uniqUsers
FROM StatsAggregated
WHERE Date > subtractDays(toDate(now()), 1)
AND uniqUsers > 12345 /* <-- 12345 is 'heuristic' number that you evaluate based on your data */
AND ErrorCode = 0 /* apply any other conditions to narrow the result set as short as possible */
GROUP BY Name
ORDER BY uniqUsers DESC LIMIT 100
use sampling
/* Raw-table */
CREATE TABLE StatsFull (
/* .. */
) ENGINE = MergeTree()
PARTITION BY toMonday(Date)
SAMPLE BY xxHash32(Uid) /* < -- */
ORDER BY Time, xxHash32(Uid)
/* Applying sampling to raw-table can make faster the short-term queries (period in several hours etc) */
SELECT Name, uniq(PcId) as cnt
FROM StatsFull
SAMPLE 0.05 /* <-- */
WHERE Time > subtractHours(now(), 6) /* <-- hours-period */
GROUP BY Name
ORDER BY cnt DESC LIMIT 100
/* Aggregated-table */
CREATE MATERIALIZED VIEW StatsAggregated (
Date Date,
ProductName String,
ErrorCode Int32,
Name String,
Version String,
UniqUsers AggregateFunction(uniq, String),
) ENGINE = AggregatingMergeTree()
PARTITION BY toMonday(Date)
SAMPLE BY intHash32(toInt32(Date)) /* < -- not sure that is good to choose */
ORDER BY (intHash32(toInt32(Date)), ProductName, ErrorCode, Name, Version)
SELECT /* .. */ FROM StatsFull GROUP BY /* .. */**
/* Applying sampling to aggregated-table can make faster the long-term queries (period in several weeks, months etc) */
SELECT Name
FROM StatsAggregated
SAMPLE 0.1 /* < -- */
WHERE Date > subtractMonths(toDate(now()), 3) /* <-- months-period */
GROUP BY Name
ORDER BY uniqMerge(UniqUsers) DESC LIMIT 100
use distributed query processing. Splitting data into several parts (shards) allows making distributed processing; additional increase of processing performance gives using distributed_group_by_no_merge-query setting.

if you need transpone array to rows you could use arrayJoin
SELECT Name, arrayJoin(topK(100)(Count)) AS top100_Count FROM Stats

Related

what are "ReverseTransform" and "MergeTreeReverse" mean in the clickhouse explain pipeline?

what are "ReverseTransform" and "MergeTreeReverse" mean in the clickhouse explain pipeline?
table
CREATE TABLE device_info
(
`event_time` DateTime,
`create_time` DateTime DEFAULT now(),
`product_id` String,
`platform` String,
`app_version` String,
`sdk_version` String,
`os_version` Strin,
`model` String,
`device_id` String,
`device_type` String,
`device_cost` Int64,
`data_id` String
)
ENGINE = ReplicatedMergeTree()
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (event_time, product_id, platform, app_version, sdk_version, os_version, model)
TTL event_time + toIntervalDay(7)
SETTINGS index_granularity = 8192
SELECT
event_time AS eventTime,
launch_cost AS launchCost,
device_id,
data_id
FROM device_info
WHERE (device_id = 'xxxxxxx') AND (product_id = 'xxxxxxx') AND (device_type IN ('type1')) AND (event_time >= 'xxxx') AND (event_time <= 'xxxxx')
ORDER BY event_time DESC
explain pipeline
what are "ReverseTransform" and "MergeTreeReverse" mean in the explain pipeline?
explain pipeline
in the trace log I find a statement is
"MergingSortedTransform: Merge sorted 645 blocks,861 rows in 5.927999977 sec. what is the statement mean?
appreciate for your answer! thx

exclude part of the select not to consider date where clause

i have a select(water readings, previous water reading, other columns) , a "where clause" that is based on date water reading date. however for previous water reading it must not consider the where clause. I want to get previous meter reading regardless where clause date range.
looked at union problem is that i have to use the same clause,
SELECT
WATERREADINGS.name,
WATERREADINGS.date,
LAG( WATERREADINGS.meter_reading,1,NULL) OVER(
PARTITION BY WATERREADINGS.meter_id,WATERREADINGS.register_id
ORDER BY WATERREADINGS.meter_id DESC,WATERREADINGS.register_id
DESC,WATERREADINGS.readingdate ASC,WATERREADINGS.created ASC
) AS prev_water_reading,
FROM WATERREADINGS
WHERE waterreadings.waterreadingdate BETWEEN '24-JUN-19' AND
'24-AUG-19' and isactive = 'Y'
The prev_water_reading value must not be restricted by the date BETWEEN '24-JUN-19' AND '24-AUG-19' predicate but the rest of the sql should be.
You can do this by first finding the previous meter readings for all rows and then filtering those results on the date, e.g.:
WITH meter_readings AS (SELECT waterreadings.name,
waterreadings.date dt,
lag(waterreadings.meter_reading, 1, NULL) OVER (PARTITION BY waterreadings.meter_id, waterreadings.register_id
ORDER BY waterreadings.readingdate ASC, waterreadings.created ASC)
AS prev_water_reading,
FROM waterreadings
WHERE isactive = 'Y')
-- the meter_readings subquery above gets all rows and finds their previous meter reading.
-- the main query below then applies the date restriction to the rows from the meter_readings subquery.
SELECT name,
date,
prev_water_reading,
FROM meter_readings
WHERE dt BETWEEN to_date('24/06/2019', 'dd/mm/yyyy') AND to_date('24/08/2019', 'dd/mm/yyyy');
Perform the LAG in an inner query that is not filtered by dates and then filter by the dates in the outer query:
SELECT name,
"date",
prev_water_reading
FROM (
SELECT name,
"date",
LAG( meter_reading,1,NULL) OVER(
PARTITION BY meter_id, register_id
ORDER BY meter_id DESC, register_id DESC, readingdate ASC, created ASC
) AS prev_water_reading,
waterreadingdate --
FROM WATERREADINGS
WHERE isactive = 'Y'
)
WHERE waterreadingdate BETWEEN DATE '2019-06-24' AND DATE '2019-08-24'
You should also not use strings for dates (that require an implicit cast using the NLS_DATE_FORMAT session parameter, which can be changed by any user in their own session) and use date literals DATE '2019-06-24' or an explicit cast TO_DATE( '24-JUN-19', 'DD-MON-RR' ).
You also do not need to reference the table name for every column when there is only a single table as this clutters up your code and makes it difficult to read and DATE is a keyword so you either need to wrap it in double quotes to use it as a column name (which makes the column name case sensitive) or should use a different name for your column.
I've added a subquery with previous result without filter and then joined it with the main table with filters:
SELECT
WATERREADINGS.name,
WATERREADINGS.date,
w_lag.prev_water_reading
FROM
WATERREADINGS,
(SELECT name, date, LAG( WATERREADINGS.meter_reading,1,NULL) OVER(
PARTITION BY WATERREADINGS.meter_id,WATERREADINGS.register_id
ORDER BY WATERREADINGS.meter_id DESC,WATERREADINGS.register_id
DESC,WATERREADINGS.readingdate ASC,WATERREADINGS.created ASC
) AS prev_water_reading
FROM WATERREADINGS) w_lag
WHERE waterreadings.waterreadingsdate BETWEEN '24-JUN-19' AND '24-AUG-19' and isactive = 'Y'
and WATERREADINGS.name = w_lag.name
and WATERREADINGS.date = w_lag.date

How can i improce performance of sqlite with large table?

SQlite DB with single table and 60,000,000 records. time to run simple query is more then 100 seconds.
I've tried to switch to postgeSQL but its performance was even less good.
Hadn't test it on mySQL or msSQL.
Shell I split the table (lets say different table for each pointID - there are some hundreds of it? or different table for each month - then I'll have maximum of 10,000,000 records?)
sql scheme:
CREATE TABLE `collectedData` (
`id` INTEGER,
`timeStamp` double,
`timeDateStr` nvarchar,
`pointID` nvarchar,
`pointIDindex` double,
`trendNumber` integer,
`status` nvarchar,
`value` double,
PRIMARY KEY(`id`)
);
CREATE INDEX `idx_pointID` ON `collectedData` (
`pointID`
);
CREATE INDEX `idx_pointIDindex` ON `collectedData` (
`pointIDindex`
);
CREATE INDEX `idx_timeStamp` ON `collectedData` (
`timeStamp`
);
CREATE INDEX `idx_trendNumber` ON `collectedData` (
`trendNumber`
);
Next query took 107 seconds:
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
next query took 150 seconds (less conditions)
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
Editing:
Asnwer from another place - add the next index:
CREATE INDEX idx_All ON collectedData (trendNumber, pointid, pointIDindex, status, timestamp desc, id desc, timeDateStr, value)
had improved performance by factor of 3.
Editing #2: by #Raymond Nijland offer: the execution plan is:
SEARCH TABLE collectedData USING COVERING INDEX idx_All (trendNumber=? AND pointID=?)"
"0" "0" "0" "EXECUTE LIST SUBQUERY 1"
"0" "0" "0" "USE TEMP B-TREE FOR ORDER BY"
and thanks to him - using this data, I've changed the order of the rules in the query to the next:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
this made big improvement (for me it's solved).
After #RaymondNijland had offered me to check the execution plan, I've changed the query to:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
This query gives same results like the other, but is't 120 times faster (decrease the number of records before sorting).

aggregate date ranges with gaps in oracle

I need to aggregate date ranges allowing for max 2 days gaps in between for each id. Any help would be much appreciated
create table tt ( id int, startdate date, stopdate date);
Insert into TT values (1,'24/05/2010', '29/05/2010');
Insert into TT values (1,'30/05/2010', '22/06/2010');
Insert into TT values (10,'26/06/2012', '28/06/2012');
Insert into TT values (10,'29/06/2012', '30/06/2012');
Insert into TT values (10,'01/07/2012', '30/07/2012');
Insert into TT values (10,'03/08/2012', '30/12/2012');
insert into TT values (90,'08/03/2002', '16/03/2002');
insert into TT values (90,'31/01/2002', '15/02/2002');
insert into TT values (90,'15/02/2002', '28/02/2002');
insert into TT values (90,'31/01/2002', '15/02/2004');
insert into TT values (90,'15/02/2004', '15/04/2004');
insert into TT values (90,'01/03/2002', '07/03/2002');
expected output would be:
1 24/05/2010 22/06/2010
10 26/06/2012 30/07/2012
10 03/08/2012 30/12/2012
90 31/01/2002 15/04/2004
If you're on 12c, you can use one of my favourite SQL features: pattern matching (match_recognize).
With this you need to define a pattern variable. This is where you'll check that the start date of the current row is within two days of the stop date for the previous row. Which is:
startdate <= prev ( stopdate ) + 2
The pattern you're searching for is any row, followed by zero or more rows that meet this criterium.
So you have an "always true" strt variable, followed by * (regular expression zero-or-more quantifier) occurrences of the within2 variable:
( strt within2* )
I'm guessing you also need to split the ranges up by ID. So I've added a partition by for this.
Put it all together and you get:
select *
from tt match_recognize (
partition by id
order by startdate, stopdate
measures
first ( startdate ) startdate,
last ( stopdate ) stopdate
pattern ( strt within2* )
define
within2 as startdate <= prev ( stopdate ) + 2
);
ID STARTDATE STOPDATE
1 24/05/2010 22/06/2010
10 26/06/2012 30/07/2012
10 03/08/2012 30/12/2012
If you want to know more about this, you can find several match_recognize examples here.

Unable to get only first occurrence of each job

I am trying to query some jobs from a repo, however I only need the job with the latest start time. I have tried using ROW_NUMBER for this and select only row number 1 for each job, however it doesn't seem to fall through:
SELECT a.jobname||','||a.projectname||','||a.startdate||','||a.enddate||','||
ROW_NUMBER() OVER ( PARTITION BY a.jobname ORDER BY a.startdate DESC ) AS "rowID"
FROM taskhistory a
WHERE a.jobname IS NOT NULL AND a.startdate >= (SYSDATE-1))LIMIT 1 AND rowID = 1;
ERROR at line 7:
ORA-00932: inconsistent datatypes: expected ROWID got NUMBER
Can I please ask for some assistance?
You have aliased your concatenated string "rowID" which is a mistake because it clashes with the Oracle keyword rowid. This is a special datatype, which allows us to identify table rows by their physical location. Find out more.
When you reference the column alias you omitted the fouble quotes. Oracle therefore interprets it as the keyword, rowid, and expects an expression which can be converted to the ROWID datatype.
Double-quoted identifiers are always a bad idea. Avoid them unless truly necessary.
Fixing the column alias will reveal the logic bug in your code. You are concatenating a whole slew of columns together, including the ROW_NUMBER() function, and calling that string "rowID". Clearly that string is never going to equal one, so this will filter out all rows:
and "rowID" = 1
Also LIMIT is not valid in Oracle.
What you need to do is use a sub-query, like this
SELECT a.jobname||','
||a.projectname||','
||a.startdate||','
||a.enddate||','
||to_char(a.rn) as "rowID"
FROM (
SELECT jobname
, projectname
, startdatem
, enddate,
, ROW_NUMBER() OVER ( PARTITION BY jobname
ORDER BY startdate DESC ) AS RN
FROM taskhistory
WHERE jobname IS NOT NULL
AND a.startdate >= (SYSDATE-1)
) a
where a.RN = 1;
Concatenating the projection like that seems an odd thing to do but I don't understand your business requirements.

Resources