Perfomance SQL Where Clause with column - performance

I have a couple of tables that look whether a "normal suscription" (A) is valid from a certain date. and a "trial Subscription" (B) that has a cancellation date.
TABLE A
ValidFrom id
2022-09-24 1
2022-01-25 2
TABLE B
id cancellationDate
1 2023-07-16
2 2023-07-16
1 2023-06-05
2 2019-07-04
1 2016-10-01
1 2023-12-16
1 2017-10-28
by trying to figure out whether there is a subscription after the trial period I created the following query.
SELECT
Sales.B.CancellationDate
,Y.ValidFrom
FROM Sales.A AS Y
INNER JOIN
Sales.B ON Y.id = Sales.B.id
WHERE
Sales.B.CancellationDate > Y.ValidFrom
ORDER BY 1
which will return the following:
CancellationDate ValidFrom
2023-06-05 2022-09-24
2023-07-16 2022-09-24
2023-07-16 2022-01-25
2023-12-16 2022-09-24
I was wondering if there is a better way to improve the performance of the following: ( Consider that indexes are there )
Sales.B.CancellationDate > Y.ValidFrom
I have the feeling that the DB Engine is doing a N x M Comparisons in order to get the result and do the BOOLEAN true.
EDIT:
Indexes/Table:
CLUSTERED INDEX [ClusteredIndex-20181107-194645] ON [Sales].[B]( [id] ASC)
CLUSTERED INDEX [ClusteredIndex-20181107-194645] ON [Sales].[A]( [id] ASC)
CREATE TABLE [Sales].[A](
[ValidFrom] [date] NULL,
[id] [tinyint] NULL
) ON [PRIMARY]
CREATE TABLE [Sales].[B](
[id] [tinyint] NULL,
[cancellationDate] [date] NULL
) ON [PRIMARY]

Related

Power Pivot - Looking up details with start and end dates?

I need help in the formula correctness of my approach in this PowerPivot table in Excel.
I have two tables: daily timesheet and employee details. The idea is to lookup the department, sub-department, and managers from the employee_details table based on the employee number and shift date in the daily_timesheet table.
Here's a representation of the tables:
daily_timesheet
shift_date
emp_num
scheduled_hours
actual_worked_hrs
dept
sub_dept
mgr
2022-02-28
01234
7.5
7.34
16100
16181
05432
2022-03-15
01234
7.5
7.50
16200
16231
06543
employee_details
emp_num
dept_code
sub_dept_code
mgr_emp_num
start_date
end_date
is_current
01234
16000
16041
04321
2022-01-01
2022-01-31
FALSE
01234
16100
16181
05432
2022-02-01
2022-01-28
FALSE
01234
16200
16231
06543
2022-03-01
null
TRUE
End dates have null values if it is the employee's current assignment; but it's never null if the is_current field is FALSE.
Here's what I've managed so far:
First, lookup the current start date in the employee details is less than or equal to the shift date.
If true, I'm using the LOOKUP function to return the department, sub-department, and manager by searching the employee number and the true value in the is_current field.
Else, I use the MIN function to get the value of those fields and wrap it around a CALCULATE function then apply FILTER for: (1) emp_num matching the timesheet, (2) is_current that has a FALSE value, (3) start_date less than or equal to the shift_date, and (4) end_date is greater than or equal to the shift_date.
And the bedrock of my question is actually, the item 3 above. I know using the MIN function is incorrect, but I can't find any solution that will work.
Here's the formula I've been using for to get the dept in the daily_timesheet table from the employee_details table:
=IF(
LOOKUP(employee_details[start_date],
employee_details[emp_num],
daily_timesheet[emp_num],
employee_details[is_current] = TRUE) <= daily_timesheet[shift_date],
LOOKUP(employee_details[dept_code],
employee_details[emp_num],
daily_timesheet[emp_num],
employee_details[is_current] = TRUE),
CALCULATE(MIN(employee_details[dept_code]),
FILTER(employee_details, employee_details[emp_num] = daily_timesheet[emp_num]),
FILTER(employee_details, employee_details[is_current] = FALSE),
FILTER(employee_details, employee_details[start_date] <= daily_timesheet[shift_date]),
FILTER(employee_details, employee_details[end_date] >= daily_timesheet[shift_date]))
)
Any advice please?

How can i improce performance of sqlite with large table?

SQlite DB with single table and 60,000,000 records. time to run simple query is more then 100 seconds.
I've tried to switch to postgeSQL but its performance was even less good.
Hadn't test it on mySQL or msSQL.
Shell I split the table (lets say different table for each pointID - there are some hundreds of it? or different table for each month - then I'll have maximum of 10,000,000 records?)
sql scheme:
CREATE TABLE `collectedData` (
`id` INTEGER,
`timeStamp` double,
`timeDateStr` nvarchar,
`pointID` nvarchar,
`pointIDindex` double,
`trendNumber` integer,
`status` nvarchar,
`value` double,
PRIMARY KEY(`id`)
);
CREATE INDEX `idx_pointID` ON `collectedData` (
`pointID`
);
CREATE INDEX `idx_pointIDindex` ON `collectedData` (
`pointIDindex`
);
CREATE INDEX `idx_timeStamp` ON `collectedData` (
`timeStamp`
);
CREATE INDEX `idx_trendNumber` ON `collectedData` (
`trendNumber`
);
Next query took 107 seconds:
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
next query took 150 seconds (less conditions)
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointIDindex % 1 = 0
order by timestamp desc, id desc limit 5000
Editing:
Asnwer from another place - add the next index:
CREATE INDEX idx_All ON collectedData (trendNumber, pointid, pointIDindex, status, timestamp desc, id desc, timeDateStr, value)
had improved performance by factor of 3.
Editing #2: by #Raymond Nijland offer: the execution plan is:
SEARCH TABLE collectedData USING COVERING INDEX idx_All (trendNumber=? AND pointID=?)"
"0" "0" "0" "EXECUTE LIST SUBQUERY 1"
"0" "0" "0" "USE TEMP B-TREE FOR ORDER BY"
and thanks to him - using this data, I've changed the order of the rules in the query to the next:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
this made big improvement (for me it's solved).
After #RaymondNijland had offered me to check the execution plan, I've changed the query to:
select * from (
select * from collectedData
where
trendNumber =1
and status <> '' and
timestamp <=1556793244
and pointid in ('point1','point2','pont3','point4','point5','point6','point7','point8','point9','pointa')
and pointIDindex % 1 = 0
order by id desc limit 5000
) order by timestamp desc
This query gives same results like the other, but is't 120 times faster (decrease the number of records before sorting).

Powerpivot: Retrieve max value for a group in a related table

I have 2 tables with a one-to-many relationship.
-TableGroup: table with groupletter
-TableAll: table with unique identifier, groupletter, a date
Problem: I want to get the max value of the date from TableAll into a new column in TableGroup. See below.
Question: What is the formula for column MAXdate?
TableAll:
ID | Group | date
1 A 4/01/2017
2 A 2/10/2016
3 A 2/06/2016
4 B 2/12/2016
5 B 15/12/2016
6 B 2/03/2017
7 C 5/02/2016
8 C 16/01/2016
TableGroup:
Group | MAXdate
A 4/01/2017
B 2/03/2017
C 5/02/2016
The below formula doesn't work:
See here
The answer is:
CALCULATE (
MAX ( TableAll[Date] ),
FILTER ( TableAll, TableAll[Group] = EARLIER ( TableGroup[Group] ) )
)
Try:
CALCULATE (
MAX ( TableAll[Date] ),
FILTER ( TableGroup, TableGroup[Group] = EARLIER ( TableGroup[Group] ) )
)
How it works:
EARLIER ( TableGroup[Group] ) expression essentially means "current row". Filter function goes row by row over TableGroup table, filters it by current row's group, and then finds max date for that group.

Add next unique value to SQL column

I have two tables which I am trying to join based on two criteria. One of the criteria is that a date from t1 is between a date in t2 and the next date in t2. The other is that the name from t1 matches the name from t2.
I.e. if t2 looks like this:
Record Name Date
1 A1234 2016-01-03 04:58:00
2 A1234 2015-12-15 08:34:00
3 A5678 2016-01-04 03:14:00
4 A1234 2016-01-05 21:06:00
Then:
Any records from t1 for Name A1234 with dates between 2016-01-03 04:58:00 and 2016-01-05 21:06:00 would be joined to record 1.
Any records from t1 for Name A1234 with dates between 2015-12-15 08:34:00 and 2016-01-03 04:58:00 would be joined to record 2
Any records from t1 for A1234 after the date of record 4 would be joined to record 4
Any records from t1 for A5678 would be joined to record 3 because there's only one date.
My initial approach is to use a correlated subquery to find the next date. However, due to a large number of records, I determined this would take over a year to execute because it searches all of t2 for the next later date during each iteration. Original SQLite:
CREATE TABLE outputtable AS SELECT * FROM t1, t2 d
WHERE t1.Name = d.Name AND t1.Date BETWEEN d.Date AND (
SELECT * FROM (
SELECT Date from t2
WHERE t2.Name = d.Name
ORDER BY Date ASC )
WHERE Date > d.Date
LIMIT 1 )
Now, I would like to find the next date only once for all records in t2 and create a new column in t2 that contains the next date. This way, I only search for the next date about 400,000 times instead of 56 billion times, significantly improving my performance.
Thus the output of the query I'm looking for would make t2 look like this:
Record Name Date Next_Date
1 A1234 2016-01-03 04:58:00 2016-01-05 21:06:00
2 A1234 2015-12-15 08:34:00 2016-01-03 04:58:00
3 A5678 2016-01-04 03:14:00 2999-12-31 23:59:59
4 A1234 2016-01-05 21:06:00 2999-12-31 23:59:59
Then I would be able to simply query whether t1.Date is between t2.Date and t2.Next_Date.
How can I build a query that will add the next date to a new column in t2?
Rather than add the new column, you should just be able to use a query like the one below to join the tables:
SELECT
T1.*,
T2_1.*
FROM
T1
INNER JOIN T2 T2_1 ON
T2_1.Name = T1.Name AND
T2_1.some_date < T1.some_date
LEFT OUTER JOIN T2 T2_2 ON
T2_2.Name = T1.Name AND
T2_2.some_date > T2_1.some_date
LEFT OUTER JOIN T2 T2_3 ON
T2_3.Name = T1.Name AND
T2_3.some_date > T2_1.some_date AND
T2_3.some_date < T2_2.some_date
WHERE
T2_3.Name IS NULL
You can do the same with NOT EXISTS, but this method often has better performance.
You can speed up (sub)queries by using proper indexes.
To check which indexes are actually used, use EXPLAIN QUERY PLAN.
Your original query, without any indexes, would be executed by SQLite 3.10.0 like this:
0|0|0|SCAN TABLE t1
0|1|1|SEARCH TABLE t2 AS d USING AUTOMATIC COVERING INDEX (name=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SCAN TABLE t2
1|0|0|USE TEMP B-TREE FOR ORDER BY
(The "automatic" index is created temporarily just for this query; the optimizer has estimated that this would still be faster than not using any index.)
In this case, you get the most optimal query plan by indexing all columns used for lookups:
create index i1nd on t1(name, date);
create index i2nd on t2(name, date);
0|0|1|SCAN TABLE t2 AS d
0|1|0|SEARCH TABLE t1 USING INDEX i1nd (name=? AND date>? AND date<?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE t2 USING COVERING INDEX i2nd (name=? AND date>?)
I've used this method on tables with around 1 mil rows with success. Obviously, creating an index that will cover this query will help performance.
This approach uses RANK to create a value to join against. After creating the RANK in a CTE (I use this for readability reasons, please correct for style or personal preference), use a sub-query to join rnk to rnk + 1; aka the next date.
Here's an example of what the code looks like using your sample values.
IF OBJECT_ID('tempdb..#T2') IS NOT NULL
DROP TABLE #T2
CREATE TABLE #T2
(
Record INT NOT NULL PRIMARY KEY,
Name VARCHAR(10),
[DATE] DATETIME,
)
INSERT INTO #T2
VALUES (1, 'A1234', '2016-01-03 04:58:00'),
(2, 'A1234', '2015-12-15 08:34:00'),
(3, 'A5678', '2016-01-04 03:14:00'),
(4, 'A1234', '2016-01-05 21:06:00');
WITH Rank_Dates
AS (Select *
,rank() OVER(PARTITION BY #t2.name ORDER BY #t2.date DESC) AS rnk
FROM #T2)
select RD1.Record,
RD1.Name,
RD1.DATE,
COALESCE (RD2.DATE, '2999-12-31 23:59:59') AS NEXT_DATE
FROM Rank_Dates RD1
LEFT JOIN Rank_Dates RD2
ON RD1.rnk = RD2.rnk + 1
AND RD1.Name = RD2.Name
ORDER BY RD1.Record -- ORDER BY is optional
;
EDIT: added code output below.
The code above produces the following output.
Record Name DATE NEXT_DATE
1 A1234 2016-01-03 04:58:00.000 2016-01-05 21:06:00.000
2 A1234 2015-12-15 08:34:00.000 2016-01-03 04:58:00.000
3 A5678 2016-01-04 03:14:00.000 2999-12-31 23:59:59.000
4 A1234 2016-01-05 21:06:00.000 2999-12-31 23:59:59.000
On a random note. Would using the CURRENT_TIMESTAMP in place of hard coding '2999-12-31 23:59:59.000' produce a similar result?

Postgres Hash Join speed [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
I have 3 tables which I wish to join using inner joins in Postgres 9.1, reads, devices, and device_patients. Below is an abbreviated schema for each table.
reads -- ~250,000 rows
CREATE TABLE reads
(
id serial NOT NULL,
device_id integer NOT NULL,
value bigint NOT NULL,
read_datetime timestamp without time zone NOT NULL,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
CONSTRAINT reads_pkey PRIMARY KEY (id )
)
WITH (
OIDS=FALSE
);
ALTER TABLE reads
OWNER TO postgres;
CREATE INDEX index_reads_on_device_id
ON reads
USING btree
(device_id );
CREATE INDEX index_reads_on_read_datetime
ON reads
USING btree
(read_datetime );
devices -- ~500 rows
CREATE TABLE devices
(
id serial NOT NULL,
serial_number character varying(20) NOT NULL,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
CONSTRAINT devices_pkey PRIMARY KEY (id )
)
WITH (
OIDS=FALSE
);
ALTER TABLE devices
OWNER TO postgres;
CREATE UNIQUE INDEX index_devices_on_serial_number
ON devices
USING btree
(serial_number COLLATE pg_catalog."default" );
patient_devices -- ~25,000 rows
CREATE TABLE patient_devices
(
id serial NOT NULL,
patient_id integer NOT NULL,
device_id integer NOT NULL,
issuance_datetime timestamp without time zone NOT NULL,
unassignment_datetime timestamp without time zone,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
CONSTRAINT patient_devices_pkey PRIMARY KEY (id )
)
WITH (
OIDS=FALSE
);
ALTER TABLE patient_devices
OWNER TO postgres;
CREATE INDEX index_patient_devices_on_device_id
ON patient_devices
USING btree
(device_id );
CREATE INDEX index_patient_devices_on_issuance_datetime
ON patient_devices
USING btree
(issuance_datetime );
CREATE INDEX index_patient_devices_on_patient_id
ON patient_devices
USING btree
(patient_id );
CREATE INDEX index_patient_devices_on_unassignment_datetime
ON patient_devices
USING btree
(unassignment_datetime );
patients -- ~1,000 rows
CREATE TABLE patients
(
id serial NOT NULL,
first_name character varying(50) NOT NULL,
middle_name character varying(50),
last_name character varying(50) NOT NULL,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
CONSTRAINT participants_pkey PRIMARY KEY (id )
)
WITH (
OIDS=FALSE
);
ALTER TABLE patients
OWNER TO postgres;
Here is my abbreviated query.
SELECT device_patients.patient_id, serial_number FROM reads
INNER JOIN devices ON devices.id = reads.device_id
INNER JOIN patient_devices ON device_patients.device_id = devices.id
WHERE (reads.read_datetime BETWEEN '2012-01-01 10:30:01.000000' AND '2013-05-18 03:03:42')
AND (read_datetime > issuance_datetime) AND ((unassignment_datetime IS NOT NULL AND read_datetime < unassignment_datetime) OR
(unassignment_datetime IS NULL))
GROUP BY serial_number, patient_devices.patient_id LIMIT 10
Ultimately this will be a small part of a larger query (without the LIMIT, I only added the limit to prove to myself that the long runtime was not due to returning a bunch of rows), however I've done a bunch of experimenting and determined that this is the slow part of the larger query. When I run EXPLAIN ANALYZE on this query I get the following output (also viewable here)
Limit (cost=156442.31..156442.41 rows=10 width=13) (actual time=2815.435..2815.441 rows=10 loops=1)
-> HashAggregate (cost=156442.31..159114.89 rows=267258 width=13) (actual time=2815.432..2815.437 rows=10 loops=1)
-> Hash Join (cost=1157.78..151455.79 rows=997304 width=13) (actual time=30.930..2739.164 rows=250150 loops=1)
Hash Cond: (devices.device_id = devices.id)
Join Filter: ((reads.read_datetime > patient_devices.issuance_datetime) AND (((patient_devices.unassignment_datetime IS NOT NULL) AND (reads.read_datetime < patient_devices.unassignment_datetime)) OR (patient_devices.unassignment_datetime IS NULL)))
-> Seq Scan on reads (cost=0.00..7236.94 rows=255396 width=12) (actual time=0.035..64.433 rows=255450 loops=1)
Filter: ((read_datetime >= '2012-01-01 10:30:01'::timestamp without time zone) AND (read_datetime <= '2013-05-18 03:03:42'::timestamp without time zone))
-> Hash (cost=900.78..900.78 rows=20560 width=37) (actual time=30.830..30.830 rows=25015 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 1755kB
-> Hash Join (cost=19.90..900.78 rows=20560 width=37) (actual time=0.776..20.551 rows=25015 loops=1)
Hash Cond: (patient_devices.device_id = devices.id)
-> Seq Scan on patient_devices (cost=0.00..581.93 rows=24893 width=24) (actual time=0.014..7.867 rows=25545 loops=1)
Filter: ((unassignment_datetime IS NOT NULL) OR (unassignment_datetime IS NULL))
-> Hash (cost=13.61..13.61 rows=503 width=13) (actual time=0.737..0.737 rows=503 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 24kB
-> Seq Scan on devices (cost=0.00..13.61 rows=503 width=13) (actual time=0.016..0.466 rows=503 loops=1)
Filter: (entity_id = 2)
Total runtime: 2820.392 ms
My question is how do I speed this up? Right now I'm running this on my Windows machine for testing, but ultimately it will be deployed on Ubuntu, will that make a difference? Any insight into why this takes 2 seconds would be greatly appreciated.
Thanks
It has been suggested that the LIMIT might be altering the query plan. Here is the same query without the LIMIT. The slow part still appears to be the Hash Join.
Also, here are the relevant tuning parameters. Again I'm only testing this on Windows now, and I don't know what effect this would have on a Linux machine
shared_buffers = 2GB
effective_cache_size = 4GB
work_mem = 256MB
random_page_cost = 2.0
Here are the statistics for the reads table
Statistic Value
Sequential Scans 130
Sequential Tuples Read 28865850
Index Scans 283630
Index Tuples Fetched 141421907
Tuples Inserted 255450
Tuples Updated 0
Tuples Deleted 0
Tuples HOT Updated 0
Live Tuples 255450
Dead Tuples 0
Heap Blocks Read 20441
Heap Blocks Hit 3493033
Index Blocks Read 8824
Index Blocks Hit 4840210
Toast Blocks Read
Toast Blocks Hit
Toast Index Blocks Read
Toast Index Blocks Hit
Last Vacuum 2013-05-20 09:23:03.782-07
Last Autovacuum
Last Analyze 2013-05-20 09:23:03.91-07
Last Autoanalyze 2013-05-17 19:01:44.075-07
Vacuum counter 1
Autovacuum counter 0
Analyze counter 1
Autoanalyze counter 6
Table Size 27 MB
Toast Table Size none
Indexes Size 34 MB
Here are the statistics for the devices table
Statistic Value
Sequential Scans 119
Sequential Tuples Read 63336
Index Scans 1053935
Index Tuples Fetched 1053693
Tuples Inserted 609
Tuples Updated 0
Tuples Deleted 0
Tuples HOT Updated 0
Live Tuples 609
Dead Tuples 0
Heap Blocks Read 32
Heap Blocks Hit 1054553
Index Blocks Read 32
Index Blocks Hit 2114305
Toast Blocks Read
Toast Blocks Hit
Toast Index Blocks Read
Toast Index Blocks Hit
Last Vacuum
Last Autovacuum
Last Analyze
Last Autoanalyze 2013-05-17 19:02:49.692-07
Vacuum counter 0
Autovacuum counter 0
Analyze counter 0
Autoanalyze counter 2
Table Size 48 kB
Toast Table Size none
Indexes Size 128 kB
Here are the statistics for the patient_devices table
Statistic Value
Sequential Scans 137
Sequential Tuples Read 3065400
Index Scans 853990
Index Tuples Fetched 46143763
Tuples Inserted 25545
Tuples Updated 24936
Tuples Deleted 0
Tuples HOT Updated 0
Live Tuples 25547
Dead Tuples 929
Heap Blocks Read 1959
Heap Blocks Hit 6099617
Index Blocks Read 1077
Index Blocks Hit 2462681
Toast Blocks Read
Toast Blocks Hit
Toast Index Blocks Read
Toast Index Blocks Hit
Last Vacuum
Last Autovacuum 2013-05-17 19:01:44.576-07
Last Analyze
Last Autoanalyze 2013-05-17 19:01:44.697-07
Vacuum counter 0
Autovacuum counter 6
Analyze counter 0
Autoanalyze counter 6
Table Size 2624 kB
Toast Table Size none
Indexes Size 5312 kB
Below is the full query that I'm trying to speed up. The smaller query is indeed faster, but I was unable to make my full query faster which is reproduced below. As suggested, I added 4 new indices, UNIQUE(device_id, issuance_datetime), UNIQUE(device_id, issuance_datetime), UNIQUE(patient_id, unassignment_datetime), UNIQUE(patient_id, unassignment_datetime)
SELECT
first_name
, last_name
, MAX(max_read) AS read_datetime
, SUM(value) AS value
, serial_number
FROM (
SELECT
pa.first_name
, pa.last_name
, value
, first_value(de.serial_number) OVER(PARTITION BY pa.id ORDER BY re.read_datetime DESC) AS serial_number -- I'm not sure if this is a good way to do this, but I don't know of another way
, re.read_datetime
, MAX(re.read_datetime) OVER (PARTITION BY pd.id) AS max_read
FROM reads re
INNER JOIN devices de ON de.id = re.device_id
INNER JOIN patient_devices pd ON pd.device_id = de.id
AND re.read_datetime >= pd.issuance_datetime
AND re.read_datetime < COALESCE(pd.unassignment_datetime , 'infinity'::timestamp)
INNER JOIN patients pa ON pa.id = pd.patient_id
WHERE re.read_datetime BETWEEN '2012-01-01 10:30:01' AND '2013-05-18 03:03:42'
) AS foo WHERE read_datetime = max_read
GROUP BY first_name, last_name, serial_number ORDER BY value desc
LIMIT 10
Sorry for not posting this earlier, but I thought this query would be too complicated, and was trying to simply the problem, but apparently I still can't figure it out. It seems like it would be a LOT quicker if I could limit the results returned by the nested select using the max_read variable, but according to numerous sources, that isn't allowed in Postgres.
FYI: sanitised query:
SELECT pd.patient_id
, de.serial_number
FROM reads re
INNER JOIN devices de ON de.id = re.device_id
INNER JOIN patient_devices pd ON pd.device_id = de.id
AND re.read_datetime >= pd.issuance_datetime -- changed this from '>' to '>='
AND (re.read_datetime < pd.unissuance_datetime OR pd.unissuance_datetime IS NULL)
WHERE re.read_datetime BETWEEN '2012-01-01 10:30:01.000000' AND '2013-05-18 03:03:42'
GROUP BY de.serial_number, pd.patient_id
LIMIT 10
;
UPDATE: without the original typos:
EXPLAIN ANALYZE
SELECT pd.patient_id
, de.serial_number
FROM reads re
INNER JOIN devices de ON de.id = re.device_id
INNER JOIN patient_devices pd ON pd.device_id = de.id
AND re.read_datetime >= pd.issuance_datetime
AND (re.read_datetime < pd.unassignment_datetime OR pd.unassignment_datetime IS NULL)
WHERE re.read_datetime BETWEEN '2012-01-01 10:30:01.000000' AND '2013-05-18 03:03:42'
GROUP BY de.serial_number, pd.patient_id
LIMIT 10
;
UPDATE: this is about 6 times as fast here (on synthetic data, and with a slightly altered data model)
-- Modified data model + synthetic data:
CREATE TABLE devices
( id serial NOT NULL
, serial_number character varying(20) NOT NULL
-- , created_at timestamp without time zone NOT NULL
-- , updated_at timestamp without time zone NOT NULL
, CONSTRAINT devices_pkey PRIMARY KEY (id )
, UNIQUE (serial_number)
) ;
CREATE TABLE reads
-- ( id serial NOT NULL PRIMARY KEY -- You don't need this surrogate key
( device_id integer NOT NULL REFERENCES devices (id)
, value bigint NOT NULL
, read_datetime timestamp without time zone NOT NULL
-- , created_at timestamp without time zone NOT NULL
-- , updated_at timestamp without time zone NOT NULL
, PRIMARY KEY ( device_id, read_datetime)
) ;
CREATE TABLE patient_devices
-- ( id serial NOT NULL PRIMARY KEY -- You don't need this surrogate key
( patient_id integer NOT NULL -- REFERENCES patients (id)
, device_id integer NOT NULL REFERENCES devices(id)
, issuance_datetime timestamp without time zone NOT NULL
, unassignment_datetime timestamp without time zone
-- , created_at timestamp without time zone NOT NULL
-- , updated_at timestamp without time zone NOT NULL
, PRIMARY KEY (device_id, issuance_datetime)
, UNIQUE (device_id, unassignment_datetime)
) ;
-- CREATE INDEX index_patient_devices_on_issuance_datetime ON patient_devices (device_id, unassignment_datetime );
-- may need some additional indices later
-- devices -- ~500 rows
INSERT INTO devices(serial_number) SELECT 'No_' || gs::text FROM generate_series(1,500) gs;
-- reads -- ~100K rows
INSERT INTO reads(device_id, read_datetime, value)
SELECT de.id, gs
, (random()*1000000)::bigint
FROM devices de
JOIN generate_series('2012-01-01', '2013-05-01' , '1 hour' ::interval) gs
ON random() < 0.02;
-- patient_devices -- ~25,000 rows
INSERT INTO patient_devices(device_id, issuance_datetime, patient_id)
SELECT DISTINCT ON (re.device_id, read_datetime)
re.device_id, read_datetime, pa
FROM generate_series(1,100) pa
JOIN reads re
ON random() < 0.01;
-- close the open intervals
UPDATE patient_devices dst
SET unassignment_datetime = src.issuance_datetime
FROM patient_devices src
WHERE src.device_id = dst.device_id
AND src.issuance_datetime > dst.issuance_datetime
AND NOT EXISTS ( SELECT *
FROM patient_devices nx
WHERE nx.device_id = src.device_id
AND nx.issuance_datetime > dst.issuance_datetime
AND nx.issuance_datetime < src.issuance_datetime
)
;
VACUUM ANALYZE patient_devices;
VACUUM ANALYZE devices;
VACUUM ANALYZE reads;
-- EXPLAIN ANALYZE
SELECT pd.patient_id
, de.serial_number
--, COUNT (*) AS zcount
FROM reads re
INNER JOIN devices de ON de.id = re.device_id
INNER JOIN patient_devices pd ON pd.device_id = de.id
AND re.read_datetime >= pd.issuance_datetime
AND re.read_datetime < COALESCE(pd.unassignment_datetime , 'infinity'::timestamp)
WHERE re.read_datetime BETWEEN '2012-01-01 10:30:01' AND '2013-05-18 03:03:42'
GROUP BY de.serial_number, pd.patient_id
LIMIT 10
;
look at the parts of the analyze report where you see Seq Scan.
for example this parts could use some indexes:
Seq Scan on patient_devices - > unassignment_datetime
Seq Scan on devices -> entity_id
Seq Scan on reads - > read_datetime
about read_datetime: it is possible to create a specific index for mathematical equation like > and <=, which will come handy. i dont know the syntax for it, though

Resources