How to improve the query speed of clickhouse - clickhouse

In clickhouse, I want to do a query operation. The query contains group by QJTD1, but QJTD1 is obtained by querying the dictionary. The statement is as follows:
`SELECT
IF(
sale_mode = 'owner',
dictGetString(
'dict.dict_sku',
'dept_id_1',
toUInt64OrZero(sku_id)
),
dictGetString(
'dict.dict_shop',
'dept_id_1',
toUInt64OrZero(shop_id)
)
) AS QJTD1,
brand_cd,
coalesce(
uniq(sd_deal_ord_user_num),
0
) AS sd_deal_ord_user_num,
0 AS item_uv,
dt
FROM app.test_all
WHERE dt >= '2020-11-01'
AND dt <= '2020-11-30'
and IF(
sale_mode = 'owner',
dictGetString(
'dict.dict_sku',
'bu_id',
toUInt64OrZero(sku_id)
),
dictGetString(
'dict.dict_shop',
'bu_id',
toUInt64OrZero(shop_id)
)
)= '1727' GROUP BY
QJTD1,
brand_cd,
dt
ORDER BY item_pv desc limit 0,
100`
, QJTD1 has serious data skew, resulting in slow query speed. I have tried to optimize the index to improve the query speed. The index is as follows: sku_id,shop_id....but it has no effect. How can I improve the query efficiency?

CH calculates both branches of IF (then & else) always.
You can use two-stage group by
select IF( sale_mode ='owner', ... as QJTD1
from (
select owner, sku_id, dept_id_1, ....
...
group by owner, sku_id, dept_id_1
)
group by QJTD1
Or define dictionary <injective>true
https://clickhouse.tech/docs/en/sql-reference/dictionaries/external-dictionaries/external-dicts-dict-structure/
Flag that shows whether the id -> attribute image is injective.
If true, ClickHouse can automatically place after the GROUP BY
clause the requests to dictionaries with injection. Usually it
significantly reduces the amount of such requests.
Default value: false.
If they are injective.
And I would test Union all then to calculate IF branches only one time.

Related

How can i improce performance of sqlite with large table?

SQlite DB with single table and 60,000,000 records. time to run simple query is more then 100 seconds.
I've tried to switch to postgeSQL but its performance was even less good.
Hadn't test it on mySQL or msSQL.
Shell I split the table (lets say different table for each pointID - there are some hundreds of it? or different table for each month - then I'll have maximum of 10,000,000 records?)
sql scheme:
CREATE TABLE `collectedData` (
`id` INTEGER,
`timeStamp` double,
`timeDateStr` nvarchar,
`pointID` nvarchar,
`pointIDindex` double,
`trendNumber` integer,
`status` nvarchar,
`value` double,
PRIMARY KEY(`id`)
);
CREATE INDEX `idx_pointID` ON `collectedData` (
`pointID`
);
CREATE INDEX `idx_pointIDindex` ON `collectedData` (
`pointIDindex`
);
CREATE INDEX `idx_timeStamp` ON `collectedData` (
`timeStamp`
);
CREATE INDEX `idx_trendNumber` ON `collectedData` (
`trendNumber`
);
Next query took 107 seconds:
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
next query took 150 seconds (less conditions)
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
Editing:
Asnwer from another place - add the next index:
CREATE INDEX idx_All ON collectedData (trendNumber, pointid, pointIDindex, status, timestamp desc, id desc, timeDateStr, value)
had improved performance by factor of 3.
Editing #2: by #Raymond Nijland offer: the execution plan is:
SEARCH TABLE collectedData USING COVERING INDEX idx_All (trendNumber=? AND pointID=?)"
"0" "0" "0" "EXECUTE LIST SUBQUERY 1"
"0" "0" "0" "USE TEMP B-TREE FOR ORDER BY"
and thanks to him - using this data, I've changed the order of the rules in the query to the next:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
this made big improvement (for me it's solved).
After #RaymondNijland had offered me to check the execution plan, I've changed the query to:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
This query gives same results like the other, but is't 120 times faster (decrease the number of records before sorting).

How do you filter a measure in MDX and output only a single row for the measure?

Given the MDX:
select {[Measures].[Effort], [Measures].[Count]} on columns from [Tickets]
.. How can zero (0) values for [Measures].[Effort] be filtered out from the [Measures].[Count] so that the resulting [Measures].[Count] value is reduced by the number of "Tickets" with zero (0) effort?
One would think that it would be easy to filter out values, however that's not the case. The following does not reduce the count of course because the final, single value output is naturally greater than zero (0):
select {[Measures].[Effort], FILTER([Measures].[Count], [Measures].[Effort] > 0 )} on 0
from [Tickets]
.. Also, please assume millions of tickets so placing a ticket ID on axis 1 and then filtering and then summing after the MDX result is returned would not be performant
If the performance is a matter and the following query is too slow:
With
Member [Measures].[RealCount] as
SUM(
IIF(
[Measures].[Effort] > 0,
[Measures].[Count],
Null
)
)
Select
{[Measures].[Effort],[Measures].[Count],[Measures].[RealCount]} on 0
From [Tickets]
You have to filter it out on the DWH to pre-calculate a real count.
I'm not sure of your ticket hierarchy structure so will guess that bit but I would imagine something along these lines:
WITH MEMBER [Measures].[RealCount] AS
SUM(
[Ticket].[Ticket].[Ticket Id],
Iif(
[Measures].[Effort] > 0
,1
,NULL
)
)
SELECT
{
[Measures].[Effort]
,[Measures].[Count]
,[Measures].[RealCount]
} on 0
FROM [Tickets];
If the above gives the correct result then it can be further improved by moving some of the logic to the cube script - this bit:
CREATE HIDDEN SumTicker;
[Measures].[SumTicker] = Iif([Measures].[Effort] > 0,1,NULL);
NON_EMPTY_BEHAVIOR([Measures].[SumTicker]) = [Measures].[Effort];
Then the script becomes:
WITH MEMBER [Measures].[RealCount] AS
SUM(
[Ticket].[Ticket].[Ticket Id],
[Measures].[SumTicker]
)
SELECT
{
[Measures].[Effort]
,[Measures].[Count]
,[Measures].[RealCount]
} on 0
FROM [Tickets];

Passing a parameter to a WITH clause query in Oracle

I'm wondering if it's possible to pass one or more parameters to a WITH clause query; in a very simple way, doing something like this (taht, obviously, is not working!):
with qq(a) as (
select a+1 as increment
from dual
)
select qq.increment
from qq(10); -- should get 11
Of course, the use I'm going to do is much more complicated, since the with clause should be in a subquery, and the parameter I'd pass are values taken from the main query....details upon request... ;-)
Thanks for any hint
OK.....here's the whole deal:
select appu.* from
(<quite a complex query here>) appu
where not exists
(select 1
from dual
where appu.ORA_APP IN
(select slot from
(select distinct slots.inizio,slots.fine from
(
with
params as (select 1900 fine from dual)
--params as (select app.ora_fine_attivita fine
-- where app.cod_agenda = appu.AGE
-- and app.ora_fine_attivita = appu.fine_fascia
--and app.data_appuntamento = appu.dataapp
--)
,
Intervals (inizio, EDM) as
( select 1700, 20 from dual
union all
select inizio+EDM, EDM from Intervals join params on
(inizio <= fine)
)
select * from Intervals join params on (inizio <= fine)
) slots
) slots
where slots.slot <= slots.fine
)
order by 1,2,3;
Without going in too deep details, the where condition should remove those records where 'appu.ORA_APP' match one of the records that are supposed to be created in the (outer) 'slots' table.
The constants used in the example are good for a subset of records (a single 'appu.AGE' value), that's why I should parametrize it, in order to use the commented 'params' table (to be replicated, then, in the 'Intervals' table.
I know thats not simple to analyze from scratch, but I tried to make it as clear as possible; feel free to ask for a numeric example if needed....
Thanks

Oracle tuning for query with query annidate

i am trying to better a query. I have a dataset of ticket opened. Every ticket has different rows, every row rappresent an update of the ticket. There is a field (dt_update) that differs it every row.
I have this indexs in the st_remedy_full_light.
IDX_ASSIGNMENT (ASSIGNMENT)
IDX_REMEDY_INC_ID (REMEDY_INC_ID)
IDX_REMDULL_LIGHT_DTUPD (DT_UPDATE)
Now, the query is performed in 8 second. Is high for me.
WITH last_ticket AS
( SELECT *
FROM st_remedy_full_light a
WHERE a.dt_update IN
( SELECT MAX(dt_update)
FROM st_remedy_full_light
WHERE remedy_inc_id = a.remedy_inc_id
)
)
SELECT remedy_inc_id, ASSIGNMENT FROM last_ticket
This is the plan
How i could to better this query?
P.S. This is just a part of a big query
Additional information:
- The table st_remedy_full_light contain 529.507 rows
You could try:
WITH last_ticket AS
( SELECT remedy_inc_id, ASSIGNMENT,
rank() over (partition by remedy_inc_id order by dt_update desc) rn
FROM st_remedy_full_light a
)
SELECT remedy_inc_id, ASSIGNMENT FROM last_ticket
where rn = 1;
The best alternative query, which is also much easier to execute, is this:
select remedy_inc_id
, max(assignment) keep (dense_rank last order by dt_update)
from st_remedy_full_light
group by remedy_inc_id
This will use only one full table scan and a (hash/sort) group by, no self joins.
Don't bother about indexed access, as you'll probably find a full table scan is most appropriate here. Unless the table is really wide and a composite index on all columns used (remedy_inc_id,dt_update,assignment) would be significantly quicker to read than the table.

Oracle query fine tuning

Hi I have a database with large number of records roughly, 400K which is supposed to grow even more.
I have a query to fetch data from this table to display records to user . my query is below.
SELECT "PC0".PYID AS "pyID" ,
"PC0".NAME AS "Name" ,
"PC0".OPPORTUNITYSTAGE AS "OpportunityStage" ,
"PC0".PXCREATEOPNAME AS "pxCreateOpName" ,
"PC0".PZINSKEY AS "pzInsKey" ,
"PC0".OPPORTUNITYSHORTNAME AS "OpportunityShortName" ,
"PC0".IDTYPE AS "IDType" ,
"PC0".IDNO AS "IDNo" ,
"Campaign".PROGRAMNAME AS "ProgramName" ,
"Campaign".ENDDATE AS "EndDate" ,
"PC0".PRODUCTNAME AS "ProductName" ,
"PC0".PRODUCTTYPE AS "ProductType" ,
"PC0".OPPORTUNITYSTAGE AS "OpportunityStage" ,
"PC0".PXCREATEOPNAME AS "pxCreateOpName" ,
"PC0".OPPORTUNITYSOURCE AS "OpportunitySource" ,
"PC0".OPPORTUNITYOWNER AS "OpportunityOwner" ,
"PC0".IDTYPE
||"PC0".IDNO AS "pyTextValue(1)" ,
"PC0".REMINDERDATE AS "ReminderDate" ,
"PC0".STAGELASTCHANGED AS "StageLastChanged" ,
ROUND((CAST(SYSDATE AS DATE) - CAST("PC0".STAGELASTCHANGED AS DATE))) AS "pyIntegerValue(1)" ,
(
CASE
WHEN ROUND((CAST(SYSDATE AS DATE) - CAST("PC0".REMINDERDATE AS DATE))) > 0
THEN 1
WHEN ROUND((CAST(SYSDATE AS DATE) - CAST("PC0".STAGELASTCHANGED AS DATE))) > 7
THEN 2
ELSE 3
END) AS "pyIntegerValue(2)" ,
"PC0".PXCREATEDATETIME AS "pxCreateDateTime" ,
"PC0".CAMPAIGNID AS "CampaignID" ,
ROUND((CAST(SYSDATE AS DATE) - CAST("PC0".REMINDERDATE AS DATE))) AS "pyIntegerValue(3)"
FROM MYCO_OPPORTUNITY "PC0"
LEFT OUTER JOIN MYCO_CAMPAIGN "Campaign"
ON ( "PC0".CAMPAIGNID = "Campaign".PYID)
ORDER BY 21 ASC,
22 DESC
This takes near to 13 seconds to fetch first 50 records in SQl developer. In real time I will be fetching almost 5k records at a time.
The time 13 sec is coming after i have defined functional index for CAST on REMINDERDATE and STAGELASTCHANGED column and a bitmap join index.
Can you please suggest how should i optimize the query. Order by on a large set might be an issue bit it is must for me. :(
Make sure you have an index on: "PC0".CAMPAIGNID and on: "Campaign".PYID
Make sure your SGA is set high enough. Without knowing a lot information about the server and database it's hard to provide guidance other than make sure the SGA is large enough.
You're using "order by" on a computed column, which means Oracle has to compute this value for all 400k rows, before being able to sort and return results. To be certain that this is the problem test without using order by.
There are a number of possible solutions but this example does not seem to be your actual use case so its pretty much meaningless to suggest optimizations for it.
Without more knowledge about the data I'd suggest splitting the query into three parts connected with union and implement indexes on reminderdate and stagelastchanged.
select * from ( [part 1] where reminderdate > sysdate order by pxCreateDateTime )
union all
select * from ( [part 2] where reminderdate <= sysdate and stagelastchanged + 7 < sysdate order by pxCreateDateTime )
union all
select * from ( [part 3] where reminderdate <= sysdate and stagelastchanged + 7 >= sysdate order by pxCreateDateTime )
I'd then expect that 1. and 2. should be satisfied using index and 3. a full table scan, which might be helped by adding a first_rows hint.

Resources