Consider there is a table of job runs history with the following schema:
job_runs
(
run_id integer not null, -- identifier of the run
job_id integer not null, -- identifier of the job
run_number integer not null, -- job run number, run numbers increment for each job
status text not null, -- status of the run (running, completed, killed, ...)
primary key (run_id)
-- ...
)
and it is required to get the last 10 runs with status != 'running' for each job (jobs differ by job_id). To do that I wrote the following query:
SELECT
*
FROM
job_runs AS JR1
WHERE
JR1.run_number IN
(
SELECT
JR2.run_number
FROM
job_runs AS JR2
WHERE
JR2.job_id = JR1.job_id
AND
JR2.status != 'running'
ORDER BY
JR2.run_number
DESC
LIMIT
10
)
It do what I need, but even though there is a multifield index on the job_id and run_num fields of the job_runs table the query is slow, because it scans job_runs table and for each its row runs subquery. The index helps subqueries to run fast each time, but the fact that the nester query scans entire table kills performance. So how can I tune performance of the query?
some thoughts:
Number of jobs (different job_ids) is small and if there were a FOR loop in SQLite it would be easy to loop over all distinct job_ids and run the subquery
passing the job id instead of JR1.job_id then UNION all results.
important:
Please don't suggest to run the loop inside the source code of my application. I need pure SQL solution.
You could increase the performance of the subquery further by creating a covering index for it:
CREATE INDEX xxx ON job_runs(job_id, run_number, status);
But the biggest performance problem is that the subquery is executed for each row, although you need to run it only for each unique job ID.
So, first, get just the unique job IDs:
SELECT DISTINCT job_id
FROM job_runs
Then, for each of these IDs, determine the tenth largest run number:
SELECT job_id,
(SELECT run_number
FROM job_runs
WHERE job_id = job_ids.job_id
AND status != 'running'
ORDER BY run_number DESC
LIMIT 1 OFFSET 9
) AS first_run_number
FROM (SELECT DISTINCT job_id
FROM job_runs) AS job_ids
But if there are less than ten run numbers for a job, the subquery returns NULL, so let's replace that with a small number so that the comparion below (run_number >= first_run_number) works:
SELECT job_id,
IFNULL((SELECT run_number
FROM job_runs
WHERE job_id = job_ids.job_id
AND status != 'running'
ORDER BY run_number DESC
LIMIT 1 OFFSET 9
), -1) AS first_run_number
FROM (SELECT DISTINCT job_id
FROM job_runs) AS job_ids
So now we have the first interesting run for each job.
Finally, join these values back to the original table:
SELECT job_runs.*
FROM job_runs
JOIN (SELECT job_id,
IFNULL((SELECT run_number
FROM job_runs
WHERE job_id = job_ids.job_id
AND status != 'running'
ORDER BY run_number DESC
LIMIT 1 OFFSET 9
), -1) AS first_run_number
FROM (SELECT DISTINCT job_id
FROM job_runs) AS job_ids
) AS firsts
ON job_runs.job_id = firsts.job_id
AND job_runs.run_number >= firsts.first_run_number;
Related
Am trying to list top 3 records from atable based on some amount stored in a column FTE_TMUSD which is of varchar datatype
below is the query i tried
SELECT *FROM
(
SELECT * FROM FSE_TM_ENTRY
ORDER BY FTE_TMUSD desc
)
WHERE rownum <= 3
ORDER BY FTE_TMUSD DESC ;
o/p i got
972,9680,963 -->FTE_TMUSD values which are not displayed in desc
I am expecting an o/p which will display the top 3 records of values
That should work; inline view is ordered by FTE_TMUSD in descending order, and you're selecting values from it.
What looks suspicious are values you specified as the result. It appears that FTE_TMUSD's datatype is VARCHAR2 (ah, yes - it is, you said so). It means that values are sorted as strings, not numbers - and it seems that you expect numbers. So, apply TO_NUMBER to that column. Note that it'll fail if column contains anything but numbers (for example, if there's a value 972C).
Also, an alternative to your query might be use of analytic functions, such as row_number:
with temp as
(select f.*,
row_number() over (order by to_number(f.fte_tmusd) desc) rn
from fse_tm_entry f
)
select *
from temp
where rn <= 3;
SQlite DB with single table and 60,000,000 records. time to run simple query is more then 100 seconds.
I've tried to switch to postgeSQL but its performance was even less good.
Hadn't test it on mySQL or msSQL.
Shell I split the table (lets say different table for each pointID - there are some hundreds of it? or different table for each month - then I'll have maximum of 10,000,000 records?)
sql scheme:
CREATE TABLE `collectedData` (
`id` INTEGER,
`timeStamp` double,
`timeDateStr` nvarchar,
`pointID` nvarchar,
`pointIDindex` double,
`trendNumber` integer,
`status` nvarchar,
`value` double,
PRIMARY KEY(`id`)
);
CREATE INDEX `idx_pointID` ON `collectedData` (
`pointID`
);
CREATE INDEX `idx_pointIDindex` ON `collectedData` (
`pointIDindex`
);
CREATE INDEX `idx_timeStamp` ON `collectedData` (
`timeStamp`
);
CREATE INDEX `idx_trendNumber` ON `collectedData` (
`trendNumber`
);
Next query took 107 seconds:
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
next query took 150 seconds (less conditions)
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
Editing:
Asnwer from another place - add the next index:
CREATE INDEX idx_All ON collectedData (trendNumber, pointid, pointIDindex, status, timestamp desc, id desc, timeDateStr, value)
had improved performance by factor of 3.
Editing #2: by #Raymond Nijland offer: the execution plan is:
SEARCH TABLE collectedData USING COVERING INDEX idx_All (trendNumber=? AND pointID=?)"
"0" "0" "0" "EXECUTE LIST SUBQUERY 1"
"0" "0" "0" "USE TEMP B-TREE FOR ORDER BY"
and thanks to him - using this data, I've changed the order of the rules in the query to the next:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
this made big improvement (for me it's solved).
After #RaymondNijland had offered me to check the execution plan, I've changed the query to:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
This query gives same results like the other, but is't 120 times faster (decrease the number of records before sorting).
I am trying to get a maximum value from a count selecting 2 label srcip and max, but everytime I include srcip I have to use group by srcip at the end and gives me result as the max wasnt even there.
When I write the query like this it gives me the correct max value but I want to select srcip as well.
Select max(count1) as maximum
from (SELECT srcip,count(srcip) as count1 from data group by srcip)t;
But when I do include srcip in the select I get result as there was no max function
Select srcip,max(count1) as maximum
from (SELECT srcip,count(srcip) as count1 from data group by srcip)t
group by srcip;
I would expect from this a single result but I get multiple.
Anyone has any ideas?
You may do ORDER BY count DESC with LIMIT 1 to get the scrip with MAX of count.
SELECT srcip, count(srcip) as count1
from data group by srcip
ORDER BY count1 DESC LIMIT 1
let's consider you have a data like this.
Table
Let's see what happens when you run following query, what happens to data.
Query
SELECT srcip,count(srcip) as count1 from data group by srcip
Output: table1
Now let's see what happens you run your outer query on above table .
Select srcip,max(count1) as maximum from table1 group by srcip
Same Output
Reason being your query says to select srcip and maximum of count from each group of srcip. And we have 3 groups, so 3 rows.
The query below returns exact one row having the max count and the associated scrip. This is the query based on the expected result; you would rather look more into sql and earlier comments, then progress to hive analytical queries.
Some people could argue that there is better way to optimize this query for your expected result but this should give you a motivation to look more into Hive analytical queries.
select scrip, count1 as maximum from (select srcip, count(scrip) over (PARTITION by scrip) as count1, row_number() over (ORDER by scrip desc) as row_num from data) q1 having row_num = 1;
MERGE INTO ////////1 GFO
USING
(SELECT *
FROM
(SELECT facto/////rid,
p-Id,
PRE/////EDATE,
RU//MODE,
cre///date,
ROW_NUMBER() OVER (PARTITION BY facto/////id ORDER BY cre///te DESC) col
FROM ///////////2
) x
WHERE x.col = 1) UFD
ON (GFO.FACTO-/////RID=UFD.FACTO////RID)
WHEN MATCHED THEN UPDATE
SET
GFO.PRE////DATE=UFD.PRE//////DATE
WHERE UFD.CRE/////DATE IS NOT NULL
AND UFD.RU//MODE= 'S'
AND GFO.P////ID=:2
hi every1, my above merge statement is taking too long , it has to run 40 times on table 1 using table2 each having 4millions plus records, for 40 different p--id, please suggest more efficient way as currently its taking 40+ minutes.
its updating only one colummn using a column from table2.t
i am unable to execute the query, its returning
Error: cannot fetch last explain plan from PLAN_TABLE
EXPLAIN PLAN IMAGE
HERE IS THE SCREENSHOT OF EXPLAIN PLAN
cost
The shown plan seems to by OK, the observed problem stems from the LOOP over P_ID that do not scale.
I assume you performs something like this (strongly simplified) - assuming the P_ID to be processed are in table TAB_PID
begin
for cur in (select p_id from tab_pid) loop
merge INTO tab1 USING tab2 ON (tab1.r_id = tab2.r_id)
WHEN MATCHED THEN
UPDATE SET tab1.col1=tab2.col1 WHERE p_id = cur.p_id;
end loop;
end;
/
HASH JOIN on large tables (in NO PARALLEL mode) with elapsed time 60 seconds is not a catastrophic result. But looping 40 times makes your 40 minutes.
So I'd sugesst to try to integrate the loop in the MERGE statement, without knowing details something like this (mayby you'll need also ajdust the MERGE JOIN condition).
merge INTO tab1 USING tab2 ON (tab1.r_id = tab2.r_id)
WHEN MATCHED THEN
UPDATE SET tab1.col1=tab2.col1
WHERE p_id in (select p_id from tab_pid);
I want to Group the rows based on certain columns, i.e. if data is same in these columns in continuous rows, then assign same Group Number to them, and if its changed, assign new one. This become complex as the same data in the columns could appear later in some other rows, so they have to be given another Group Number as they are not in continuous rows with previous group.
I used cte for this purpose and it is giving correct output also, but is so slow that iterating over 75k+ rows takes about 15 minutes. The code I used is:
WITH
cte AS (SELECT ROW_NUMBER () OVER (ORDER BY Patient_ID, Opnamenummer, SPECIALISMEN, Opnametype, OntslagDatumTijd) AS RowNumber,
Opnamenummer, Patient_ID, AfdelingsCode, Opnamedatum, Opnamedatumtijd, Ontslagdatum, Ontslagdatumtijd, IsSpoedopname, OpnameType, IsNuOpgenomen, SpecialismeCode, Specialismen
FROM t_opnames)
SELECT * INTO #ttt FROM cte;
WITH cte2 AS (SELECT TOP 1 RowNumber,
1 AS GroupNumber,
Opnamenummer, Patient_ID, AfdelingsCode, Opnamedatum, Opnamedatumtijd, Ontslagdatum, Ontslagdatumtijd, IsSpoedopname, OpnameType, IsNuOpgenomen, SpecialismeCode, Specialismen
FROM #ttt
ORDER BY RowNumber
UNION ALL
SELECT c1.RowNumber,
CASE
WHEN c2.Afdelingscode <> c1.Afdelingscode
OR c2.Patient_ID <> c1.Patient_ID
OR c2.Opnametype <> c1.Opnametype
THEN c2.GroupNumber + 1
ELSE c2.GroupNumber
END AS GroupNumber,
c1.Opnamenummer,c1.Patient_ID,c1.AfdelingsCode,c1.Opnamedatum,c1.Opnamedatumtijd,c1.Ontslagdatum,c1.Ontslagdatumtijd,c1.IsSpoedopname,c1.OpnameType,c1.IsNuOpgenomen, SpecialismeCode, Specialismen
FROM cte2 c2
JOIN #ttt c1 ON c1.RowNumber = c2.RowNumber + 1
)
SELECT *
FROM cte2
OPTION (MAXRECURSION 0) ;
DROP TABLE #ttt
I tried to improve performance by putting output of cte in a temp table. That increased the performance, but still its too slow. So, how can I increase the performance of this code to run it under 10 seconds for 75k+ records? The output before cancelling the query is: Screenshot. As visible from the image, data is same in columns Afdelingscode,Patient_ID and Opnametype in RowNumber 3,5 and 6, but they have different GroupNumber because of concurrency of the rows.
Without data its not that easy to test but i would try first to not use temporary table and just use both cte from start to end, ie;
;WITH
cte AS (...),
cte2 AS (...)
select * from cte2
OPTION (MAXRECURSION 0);
Without knowing indices etc... for instance, you do a lot of ordering in the first cte. Is this supported by indices (or one multicolumn index) or not?
Without the data i don't have the option to play with it but looking at this:
CASE
WHEN c2.Afdelingscode <> c1.Afdelingscode
OR c2.Patient_ID <> c1.Patient_ID
OR c2.Opnametype <> c1.Opnametype
THEN c2.GroupNumber + 1
ELSE c2.GroupNumber
i would try to take a look at partition by statement in row_number
So try to run this:
WITH
cte AS (
SELECT ROW_NUMBER () OVER (PARTITION BY Afdelingscode , Patient_ID ,Opnametype ORDER BY Patient_ID, Opnamenummer, SPECIALISMEN, Opnametype, OntslagDatumTijd ) AS RowNumber,
Opnamenummer, Patient_ID, AfdelingsCode, Opnamedatum, Opnamedatumtijd, Ontslagdatum, Ontslagdatumtijd, IsSpoedopname, OpnameType, IsNuOpgenomen
FROM t_opnames)