SQlite DB with single table and 60,000,000 records. time to run simple query is more then 100 seconds.
I've tried to switch to postgeSQL but its performance was even less good.
Hadn't test it on mySQL or msSQL.
Shell I split the table (lets say different table for each pointID - there are some hundreds of it? or different table for each month - then I'll have maximum of 10,000,000 records?)
sql scheme:
CREATE TABLE `collectedData` (
`id` INTEGER,
`timeStamp` double,
`timeDateStr` nvarchar,
`pointID` nvarchar,
`pointIDindex` double,
`trendNumber` integer,
`status` nvarchar,
`value` double,
PRIMARY KEY(`id`)
);
CREATE INDEX `idx_pointID` ON `collectedData` (
`pointID`
);
CREATE INDEX `idx_pointIDindex` ON `collectedData` (
`pointIDindex`
);
CREATE INDEX `idx_timeStamp` ON `collectedData` (
`timeStamp`
);
CREATE INDEX `idx_trendNumber` ON `collectedData` (
`trendNumber`
);
Next query took 107 seconds:
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
next query took 150 seconds (less conditions)
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
Editing:
Asnwer from another place - add the next index:
CREATE INDEX idx_All ON collectedData (trendNumber, pointid, pointIDindex, status, timestamp desc, id desc, timeDateStr, value)
had improved performance by factor of 3.
Editing #2: by #Raymond Nijland offer: the execution plan is:
SEARCH TABLE collectedData USING COVERING INDEX idx_All (trendNumber=? AND pointID=?)"
"0" "0" "0" "EXECUTE LIST SUBQUERY 1"
"0" "0" "0" "USE TEMP B-TREE FOR ORDER BY"
and thanks to him - using this data, I've changed the order of the rules in the query to the next:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
this made big improvement (for me it's solved).
After #RaymondNijland had offered me to check the execution plan, I've changed the query to:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
This query gives same results like the other, but is't 120 times faster (decrease the number of records before sorting).
Related
In clickhouse, I want to do a query operation. The query contains group by QJTD1, but QJTD1 is obtained by querying the dictionary. The statement is as follows:
`SELECT
IF(
sale_mode = 'owner',
dictGetString(
'dict.dict_sku',
'dept_id_1',
toUInt64OrZero(sku_id)
),
dictGetString(
'dict.dict_shop',
'dept_id_1',
toUInt64OrZero(shop_id)
)
) AS QJTD1,
brand_cd,
coalesce(
uniq(sd_deal_ord_user_num),
0
) AS sd_deal_ord_user_num,
0 AS item_uv,
dt
FROM app.test_all
WHERE dt >= '2020-11-01'
AND dt <= '2020-11-30'
and IF(
sale_mode = 'owner',
dictGetString(
'dict.dict_sku',
'bu_id',
toUInt64OrZero(sku_id)
),
dictGetString(
'dict.dict_shop',
'bu_id',
toUInt64OrZero(shop_id)
)
)= '1727' GROUP BY
QJTD1,
brand_cd,
dt
ORDER BY item_pv desc limit 0,
100`
, QJTD1 has serious data skew, resulting in slow query speed. I have tried to optimize the index to improve the query speed. The index is as follows: sku_id,shop_id....but it has no effect. How can I improve the query efficiency?
CH calculates both branches of IF (then & else) always.
You can use two-stage group by
select IF( sale_mode ='owner', ... as QJTD1
from (
select owner, sku_id, dept_id_1, ....
...
group by owner, sku_id, dept_id_1
)
group by QJTD1
Or define dictionary <injective>true
https://clickhouse.tech/docs/en/sql-reference/dictionaries/external-dictionaries/external-dicts-dict-structure/
Flag that shows whether the id -> attribute image is injective.
If true, ClickHouse can automatically place after the GROUP BY
clause the requests to dictionaries with injection. Usually it
significantly reduces the amount of such requests.
Default value: false.
If they are injective.
And I would test Union all then to calculate IF branches only one time.
I am having a situation were I have to find out such records from the tables who takes more than 24 hrs two load in DW.
so for this I am having two tables
Table 1 :- Which contains the stats about each and every load
Table 2 :- Which contains the stats about when we received the each file to load
Now I want only those records which took more than 24 hrs to load.
The date on which I have received a file is in table 2 whereas when its load is finished in in table 1, so table2 may have more than 1 entries for each file.
I have developed a below query but it's taking more time
SELECT
rcd.file_date,
rcd.recived_on as "Date received On",
rcd.loaded_On "Date Processed On",
to_char(rcd.recived_on,'DY') as "Day",
round((rcd.loaded_On - rcd.recived_on)*24,2) as "time required"
FROM (
SELECT
tbl1.file_date,
(SELECT tbl2.recived_on
FROM ( SELECT recived_on
FROM table2
Where fileName = tbl1.feedName
order by recived_on) tbl2
WHERE rownum = 1) recived_on,
tbl1.loaded_On,
to_char(tbl2.recived_on,'DY'),
round((tbl1.loaded_On - tbl2.recived_on)*24,2)
FROM Table1 tbl1 ,
Table1 tbl2
WHERE
tbl1.id=tbl2.id
AND tbl1.FileState = 'Success'
AND trunc(loaded_On) between '25-Feb-2020' AND '03-Mar-2020'
) rcd
WHERE (rcd.loaded_On - rcd.recived_on)*24 > 24;
I think a lot of your problem most likely stems from the use of the subquery in your column list of your inner query. Maybe try using an analytic function instead. Something like this:
SELECT rcd.file_date,
rcd.recived_on AS "Date received On",
rcd.loaded_On "Date Processed On",
to_char(rcd.recived_on, 'DY') AS "Day",
round((rcd.loaded_On - rcd.recived_on) * 24, 2) AS "time required"
FROM (SELECT tbl1.file_date,
MIN(tbl2.recived_on) OVER (PARTITION BY tbl2.filename) AS recived_on,
tbl1.loaded_On
FROM Table1 tbl1
INNER JOIN Table1 tbl2 ON tbl1.id = tbl2.id
WHERE tbl1.FileState = 'Success'
AND trunc(loaded_On) BETWEEN '25-Feb-2020' AND '03-Mar-2020') rcd
WHERE (rcd.loaded_On - rcd.recived_on) * 24 > 24;
Also, you were selecting some columns in the inner query and not using them, so I removed them.
I have a couple of tables that look whether a "normal suscription" (A) is valid from a certain date. and a "trial Subscription" (B) that has a cancellation date.
TABLE A
ValidFrom id
2022-09-24 1
2022-01-25 2
TABLE B
id cancellationDate
1 2023-07-16
2 2023-07-16
1 2023-06-05
2 2019-07-04
1 2016-10-01
1 2023-12-16
1 2017-10-28
by trying to figure out whether there is a subscription after the trial period I created the following query.
SELECT
Sales.B.CancellationDate
,Y.ValidFrom
FROM Sales.A AS Y
INNER JOIN
Sales.B ON Y.id = Sales.B.id
WHERE
Sales.B.CancellationDate > Y.ValidFrom
ORDER BY 1
which will return the following:
CancellationDate ValidFrom
2023-06-05 2022-09-24
2023-07-16 2022-09-24
2023-07-16 2022-01-25
2023-12-16 2022-09-24
I was wondering if there is a better way to improve the performance of the following: ( Consider that indexes are there )
Sales.B.CancellationDate > Y.ValidFrom
I have the feeling that the DB Engine is doing a N x M Comparisons in order to get the result and do the BOOLEAN true.
EDIT:
Indexes/Table:
CLUSTERED INDEX [ClusteredIndex-20181107-194645] ON [Sales].[B]( [id] ASC)
CLUSTERED INDEX [ClusteredIndex-20181107-194645] ON [Sales].[A]( [id] ASC)
CREATE TABLE [Sales].[A](
[ValidFrom] [date] NULL,
[id] [tinyint] NULL
) ON [PRIMARY]
CREATE TABLE [Sales].[B](
[id] [tinyint] NULL,
[cancellationDate] [date] NULL
) ON [PRIMARY]
Consider there is a table of job runs history with the following schema:
job_runs
(
run_id integer not null, -- identifier of the run
job_id integer not null, -- identifier of the job
run_number integer not null, -- job run number, run numbers increment for each job
status text not null, -- status of the run (running, completed, killed, ...)
primary key (run_id)
-- ...
)
and it is required to get the last 10 runs with status != 'running' for each job (jobs differ by job_id). To do that I wrote the following query:
SELECT
*
FROM
job_runs AS JR1
WHERE
JR1.run_number IN
(
SELECT
JR2.run_number
FROM
job_runs AS JR2
WHERE
JR2.job_id = JR1.job_id
AND
JR2.status != 'running'
ORDER BY
JR2.run_number
DESC
LIMIT
10
)
It do what I need, but even though there is a multifield index on the job_id and run_num fields of the job_runs table the query is slow, because it scans job_runs table and for each its row runs subquery. The index helps subqueries to run fast each time, but the fact that the nester query scans entire table kills performance. So how can I tune performance of the query?
some thoughts:
Number of jobs (different job_ids) is small and if there were a FOR loop in SQLite it would be easy to loop over all distinct job_ids and run the subquery
passing the job id instead of JR1.job_id then UNION all results.
important:
Please don't suggest to run the loop inside the source code of my application. I need pure SQL solution.
You could increase the performance of the subquery further by creating a covering index for it:
CREATE INDEX xxx ON job_runs(job_id, run_number, status);
But the biggest performance problem is that the subquery is executed for each row, although you need to run it only for each unique job ID.
So, first, get just the unique job IDs:
SELECT DISTINCT job_id
FROM job_runs
Then, for each of these IDs, determine the tenth largest run number:
SELECT job_id,
(SELECT run_number
FROM job_runs
WHERE job_id = job_ids.job_id
AND status != 'running'
ORDER BY run_number DESC
LIMIT 1 OFFSET 9
) AS first_run_number
FROM (SELECT DISTINCT job_id
FROM job_runs) AS job_ids
But if there are less than ten run numbers for a job, the subquery returns NULL, so let's replace that with a small number so that the comparion below (run_number >= first_run_number) works:
SELECT job_id,
IFNULL((SELECT run_number
FROM job_runs
WHERE job_id = job_ids.job_id
AND status != 'running'
ORDER BY run_number DESC
LIMIT 1 OFFSET 9
), -1) AS first_run_number
FROM (SELECT DISTINCT job_id
FROM job_runs) AS job_ids
So now we have the first interesting run for each job.
Finally, join these values back to the original table:
SELECT job_runs.*
FROM job_runs
JOIN (SELECT job_id,
IFNULL((SELECT run_number
FROM job_runs
WHERE job_id = job_ids.job_id
AND status != 'running'
ORDER BY run_number DESC
LIMIT 1 OFFSET 9
), -1) AS first_run_number
FROM (SELECT DISTINCT job_id
FROM job_runs) AS job_ids
) AS firsts
ON job_runs.job_id = firsts.job_id
AND job_runs.run_number >= firsts.first_run_number;
I have below update query to set some values and controle the data flow.But i am getting error "Too many values" from the condtion(subquery)when i execute the bellow query.
UPDATE MTB ----- TABLE NAME
SET MTB_EXTR_FLAG='N',
MTB_ALOC_PROCESS='DC1'
WHERE MTB_I IN --- PRIMARY KEY
(
SELECT * FROM
(
SELECT MTB_I ,ROW_NUMBER() OVER (ORDER BY ROWID) AS RN
FROM MTB
)
WHERE RN BETWEEN 100 AND 500
)
Here my intension is selecting the different set up data per processing of one job.
I want to set MTB_EXTR_FLAG='N',MTB_ALOC_PROCESS='DC1' each time before running of the job with different set of data.
Can someone please help me to resolve the error issue or propose different query.
Thank you.
I think this is just a matter of number of columns not matching (2 - MTB_I and RN - instead of 1 - MTB_I):
UPDATE MTB
SET MTB_EXTR_FLAG='N',
MTB_ALOC_PROCESS='DC1'
WHERE MTB_I IN --- PRIMARY KEY
(
SELECT MTB_I FROM -- Else RN will be taken !!
(
SELECT MTB_I ,ROW_NUMBER() OVER (ORDER BY ROWID) AS RN
FROM MTB
)
WHERE RN BETWEEN 100 AND 500
)
You can't do where x in (...) with a subquery returning more fields than expected.