Why doesn't my Postgres SQL Query use an index - performance

I have a table (sales_points) with ~23M rows. It has a b-tree index on (store_id, book_id). I would expect the following query to use that index, but EXPLAIN indicates that it is doing a sequential scan:
select distinct store_id, book_id from sales_points
Here is the output from EXPLAIN:
Unique (cost=2050448.88..2086120.31 rows=861604 width=8)
-> Sort (cost=2050448.88..2062339.35 rows=23780957 width=8)
Sort Key: store_id, book_id
-> Seq Scan on sales_points (cost=0.00..1003261.87 rows=23780957 width=8)
If I do this, it does use the index:
select distinct book_id from sales_points where store_id = 1
Here is the EXPLAIN output from this query:
HashAggregate (cost=999671.02..999672.78 rows=587 width=4)
-> Bitmap Heap Scan on sales_points (cost=55576.17..998149.04 rows=3043963 width=4)
Recheck Cond: (store_id = 1)
-> Bitmap Index Scan on index_sales_points_on_store_id_and_book_id (cost=0.00..55423.97 rows=3043963 width=0)
Index Cond: (store_id = 1)
Here is the table DDL:
CREATE TABLE sales_points
(
id serial NOT NULL,
book_id integer,
store_id integer,
date date,
created_at timestamp without time zone,
updated_at timestamp without time zone,
avg_list_price numeric(5,2),
royalty_amt numeric(9,2),
currency character varying(255),
settlement_date date,
paid_sales integer,
paid_returns integer,
free_sales integer,
free_returns integer,
lent_units integer,
lending_revenue numeric(9,2),
is_placeholder boolean,
distributor_id integer,
source1_id integer,
source2_id integer,
source3_id integer,
CONSTRAINT sales_points_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
Here is the index expression:
CREATE INDEX index_sales_points_on_store_id_and_book_id
ON sales_points
USING btree
(store_id, book_id);
So why wouldn't Postgres use the index to speed up the SELECT?

Well, I think your index is working OK when is needed. Your first query has no WHERE clause, so Postgres will have to retrieve all records in the table anyway.
Just for testing, you can force the use of the index by disabling sequential scans:
SET enable_seqscan = OFF;
Postgres chooses it's scan plan depending on varoius conditions. Taken from: http://www.postgresql.org/docs/9.2/static/indexes-examine.html
...When indexes are not used, it can be useful for testing to force their use. There are run-time parameters that can turn off various plan types. For instance, turning off sequential scans (enable_seqscan) and nested-loop joins (enable_nestloop), which are the most basic plans, will force the system to use a different plan. If the system still chooses a sequential scan or nested-loop join then there is probably a more fundamental reason why the index is not being used...

Related

My remote Postgres query seems to hang forever

I am running the following query on a remote Postgres instance, from a local client:
select * from matches_tb1 order by match_id desc limit 10;
matches_tb1 is a foreign table and has match_id as unique index. The query seems to hang forever. When I use explain verbose, there is no ORDER BY attached to "Remote SQL". I guess local server did not push down order by to remote server. How can I resolve this?
Attached is explain results:
explain verbose select match_id from matches_tb1 order by match_id desc limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------------
Limit (cost=33972852.96..33972852.98 rows=10 width=8)
Output: match_id
-> Sort (cost=33972852.96..35261659.79 rows=515522734 width=8)
Output: match_id
Sort Key: matches_tb1.match_id DESC
-> Foreign Scan on public.matches_tb1 (cost=100.00..22832592.02 rows=515522734 width=8)
Output: match_id
Remote SQL: SELECT match_id FROM public.matches_tb1
(8 rows)
For the first query in your question:
select * from matches_tb1 order by match_id desc limit 10;
It appears based on the EXPLAIN plan that Postgres is not using the match_id B-tree index. This is resulting in a very long query, because the database has to scan the entire 500 million record table and sort, to find the 10 records. As to why Postgres cannot use the index, the problem is select *. When the database reaches the leaf node of every entry in the index, it only finds a value for match_id. However, since you are doing select *, the database would have to do a lookup into the clustered index to find the values for all the other columns. If your table has low correlation, then the optimizer would likely choose to abandon the index altogether and just do a full scan of the table.
In contrast, consider one of your other queries which is executing quickly:
select match_id from matches_tb1 where match_id > 4164287140
order by match_id desc limit 10
In this case, the index on match_id can be used, because you are only selecting match_id. In addition, the restriction in the where clause helps even more to make the index more specific.
So the resolution to your problem here is to not do select * with limit, if you want the query to finish quickly. For example, if you only wanted say two columns col1 and col2 from your table, then you may add those columns to the index to cover them. Then, the following query should also be fast:
select match_id, col1, col2 from matches_tb1 order by match_id desc limit 10;

Accelerate SQLite Query

I'm currently learning SQLite (called by Python).
According to my previous question (Reorganising Data in SQLLIte), I want to store multiple time series (Training data) in my database.
I have defined the following fields:
CREATE TABLE VARLIST
(
VarID INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE NOT NULL
)
CREATE TABLE DATAPOINTS
(
DataID INTEGER PRIMARY KEY,
timeID INTEGER,
VarID INTEGER,
value REAL
)
CREATE TABLE TIMESTAMPS
(
timeID INTEGER PRIMARY KEY AUTOINCREMENT,
TRAININGS_ID INT,
TRAINING_TIME_SECONDS FLOAT
)
VARLIST has 8 entries, TIMESTAMPS 1e5 entries and DATAPOINTS around 5e6.
When I now want to extract data for a given TrainingsID and VarID, I try it like:
SELECT
(SELECT TIMESTAMPS.TRAINING_TIME_SECONDS
FROM TIMESTAMPS
WHERE t.timeID = timeID) AS TRAINING_TIME_SECONDS,
(SELECT value
FROM DATAPOINTS
WHERE DATAPOINTS.timeID = t.timeID and DATAPOINTS.VarID = 2) as value
FROM
(SELECT timeID
FROM TIMESTAMPS
WHERE TRAININGS_ID = 96) as t;
The command EXPLAIN QUERY PLAN delivers:
0|0|0|SCAN TABLE TIMESTAMPS
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE TIMESTAMPS USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SCAN TABLE DATAPOINTS
This basically works.
But there are two problems:
Minor problem: If there is a timeID where no data for the requested VarID is availabe, I get an line with the valueNone`.
I would prefer this line to be skipped.
Big problem: the search is incredibly slow (approx 5 minutes using http://sqlitebrowser.org/).
How do I best improve the performance?
Are there better ways to formulate the SELECT command, or should I modify the database structure itself?
Ok, based on the hints I have got I could extremly accelerate the search by applieng INDEXES as:
CREATE INDEX IF NOT EXISTS DP_Index on DATAPOINTS (VarID,timeID,DataID);
CREATE INDEX IF NOT EXISTS TS_Index on TIMESTAMPS(TRAININGS_ID,timeID);
The EXPLAIN QUERY PLAN output now reads as:
0|0|0|SEARCH TABLE TIMESTAMPS USING COVERING INDEX TS_Index (TRAININGS_ID=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE TIMESTAMPS USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SEARCH TABLE DATAPOINTS USING INDEX DP_Index (VarID=? AND timeID=?)
Thanks for your comments.

T-SQL - wrong query execution plan behaviour

One of our queries degraded after generating load on the DB.
Our query is a join between 3 tables:
Base table which contain 10 M rows.
EventPerson table which contain 5000 rows.
EventPerson788 which is empty.
It seems that the optimizer scans the index on the EventPerson instead of seek, this the script for replicating the issue:
--Create Tables
CREATE TABLE [dbo].[BASE](
[ID] [bigint] NOT NULL,
[IsActive] BIT
PRIMARY KEY CLUSTERED ([ID] ASC)
)ON [PRIMARY]
GO
CREATE TABLE [dbo].[EventPerson](
[DUID] [bigint] NOT NULL,
[PersonInvolvedID] [bigint] NULL,
PRIMARY KEY CLUSTERED ([DUID] ASC)
) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [EventPerson_IDX] ON [dbo].[EventPerson]
(
[PersonInvolvedID] ASC
)
CREATE TABLE [dbo].[EventPerson788](
[EntryID] [bigint] NOT NULL,
[LinkedSuspectID] [bigint] NULL,
[sourceid] [bigint] NULL,
PRIMARY KEY CLUSTERED ([EntryID] ASC)
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[EventPerson788] WITH CHECK
ADD CONSTRAINT [FK7A34153D3720F84A]
FOREIGN KEY([sourceid]) REFERENCES [dbo].[EventPerson] ([DUID])
GO
ALTER TABLE [dbo].[EventPerson788] CHECK CONSTRAINT [FK7A34153D3720F84A]
GO
CREATE NONCLUSTERED INDEX [EventPerson788_IDX]
ON [dbo].[EventPerson788] ([LinkedSuspectID] ASC)
GO
--POPOLATE BASE TABLE
DECLARE #I BIGINT=1
WHILE (#I<10000000)
BEGIN
begin transaction
INSERT INTO BASE(ID) VALUES(#I)
SET #I+=1
if (#I%10000=0 )
begin
commit;
end;
END
go
--POPOLATE EventPerson TABLE
DECLARE #I BIGINT=1
WHILE (#I<5000)
BEGIN
BEGIN TRANSACTION
INSERT INTO EventPerson(DUID,PersonInvolvedID) VALUES(#I,(SELECT TOP 1 ID FROM BASE ORDER BY NEWID()))
SET #I+=1
IF(#I%10000=0 )
COMMIT TRANSACTION ;
END
GO
This the query :
select
count(EventPerson.DUID)
from
EventPerson
inner loop join
Base on EventPerson.DUID = base.ID
left outer join
EventPerson788 on EventPerson.DUID = EventPerson788.sourceid
where
(EventPerson.PersonInvolvedID = 37909 or
EventPerson788.LinkedSuspectID = 37909)
AND BASE.IsActive = 1
Do you have any idea why the optimizer decides to use index scan instead of index seek?
Workaround that already done :
Analyze tables and build statistics.
Rebuild Indices.
Try the FORCESEEK hint
None of the above persuaded the optimizer to run an index seek on EventPerson and seek on the base tables.
Thanks for your help .
The scan is there because of the or condition and the outer join against EventPerson788.
Either it will return rows from EventPerson when EventPerson.PersonInvolvedID = 37909 or when the there exists rows in EventPerson788 where EventPerson788.LinkedSuspectID = 37909. The last part means that every row in EventPerson has to be checked against the join.
The fact that EventPerson788 is empty can not be used by the query optimizer since the query plan is saved to be reused later when there might be matching rows in EventPerson788.
Update:
You can rewrite your query using a union all instead of or to get a seek in EventPerson.
select count(EventPerson.DUID)
from
(
select EventPerson.DUID
from EventPerson
where EventPerson.PersonInvolvedID = 1556 and
not exists (select *
from EventPerson788
where EventPerson788.LinkedSuspectID = 1556)
union all
select EventPerson788.sourceid
from EventPerson788
where EventPerson788.LinkedSuspectID = 1556
) as EventPerson
inner join BASE
on EventPerson.DUID=base.ID
where
BASE.IsActive=1
Well, you're asking SQL Server to count the rows of the EventPerson table - so why do you expect a seek to be better than a scan here?
For a COUNT, the SQL Server optimizer will almost always use a scan - it needs to count the rows, after all - all of them... it will do a clustered index scan, if no other non-nullable columns are indexed.
If you have an index on a small, non-nullable column (e.g. on a ID INT or something like that), it would probably do a scan on that index instead (less data to read to count all rows).
But in general: seek is great for selecting one or a few rows - but it sucks if you're dealing with all rows (like for a count)
You can easily observe this behavior if you're using the AdventureWorks sample database.
When doing a COUNT(*) on the Sales.SalesOrderDetail table which has over 120000 rows like this:
SELECT COUNT(*) FROM Sales.SalesOrderDetail
then you'll get an index scan on IX_SalesOrderDetail_ProductID - it just doesn't pay off to do seeks on over 120000 entries!
However, if you do the same operation on a smaller set of data, like this:
SELECT COUNT(*) FROM Sales.SalesOrderDetail
WHERE ProductID = 897
then you get back 2 rows out of all of them - and SQL Server will now use an index seek on that same index.

Why is this count query so slow?

Hi I'm hosted on Heroku running postgresql 9.1.6 on a their Ika plan (7,5gb ram). I have a table called cars. I need to do the following:
SELECT COUNT(*) FROM "cars" WHERE "cars"."reference_id" = 'toyota_hilux'
Now this takes an awful lot of time (64 sec!!!)
Aggregate (cost=2849.52..2849.52 rows=1 width=0) (actual time=63388.390..63388.391 rows=1 loops=1)
-> Bitmap Heap Scan on cars (cost=24.76..2848.78 rows=1464 width=0) (actual time=1169.581..63387.361 rows=739 loops=1)
Recheck Cond: ((reference_id)::text = 'toyota_hilux'::text)
-> Bitmap Index Scan on index_cars_on_reference_id (cost=0.00..24.69 rows=1464 width=0) (actual time=547.530..547.530 rows=832 loops=1)
Index Cond: ((reference_id)::text = 'toyota_hilux'::text)
Total runtime: 64112.412 ms
A little background:
The table holds around 3.2m rows, and the column that I'm trying to count on, has the following setup:
reference_id character varying(50);
and index:
CREATE INDEX index_cars_on_reference_id
ON cars
USING btree
(reference_id COLLATE pg_catalog."default" );
What am I doing wrong? I expect that this performance is not what I should expect - or should I?
What #Satya claims in his comment is not quite true. In the presence of a matching index, the planner only chooses a full table scan if table statistics imply it would return more than around 5 % (depends) of the table, because it is then faster to scan the whole table.
As you see from your own question this is not the case for your query. It uses a Bitmap Index Scan followed by a Bitmap Heap Scan. Though I would have expected a plain index scan. (?)
I notice two more things in your explain output:
The first scan find 832 rows, while the second reduces the count to 739. This would indicate that you have many dead tuples in your index.
Check the execution time after each step with EXPLAIN ANALYZE and maybe add the results to your question:
First, rerun the query with EXPLAIN ANALYZE two or three times to populate the cache. What's the result of the last run compared to the first?
Next:
VACUUM ANALYZE cars;
Rerun.
If you have lots of write operations on the table, I would set a fill factor lower than 100. Like:
ALTER TABLE cars SET (fillfactor=90);
Lower if your row size is big or you have a lot of write operations. Then:
VACUUM FULL ANALYZE cars;
This will take a while. Rerun.
Or, if you can afford to do this (and other important queries do not have contradicting requirements):
CLUSTER cars USING index_cars_on_reference_id;
This rewrites the table in the physical order of the index, which should make this kind of query much faster.
Normalize schema
If you need this to be really fast, create a table car_type with a serial primary key and reference it from the table cars. This will shrink the necessary index to a fraction of what it is now.
Goes without saying that you make a backup before you try any of this.
CREATE temp TABLE car_type (
car_type_id serial PRIMARY KEY
, car_type text
);
INSERT INTO car_type (car_type)
SELECT DISTINCT car_type_id FROM cars ORDER BY car_type_id;
ANALYZE car_type;
CREATE UNIQUE INDEX car_type_uni_idx ON car_type (car_type); -- unique types
ALTER TABLE cars RENAME COLUMN car_type_id TO car_type; -- rename old col
ALTER TABLE cars ADD COLUMN car_type_id int; -- add new int col
UPDATE cars c
SET car_type_id = ct.car_type_id
FROM car_type ct
WHERE ct.car_type = c.car_type;
ALTER TABLE cars DROP COLUMN car_type; -- drop old varchar col
CREATE INDEX cars_car_type_id_idx ON cars (car_type_id);
ALTER TABLE cars
ADD CONSTRAINT cars_car_type_id_fkey FOREIGN KEY (car_type_id )
REFERENCES car_type (car_type_id) ON UPDATE CASCADE; -- add fk
VACUUM FULL ANALYZE cars;
Or, if you want to go all-out:
CLUSTER cars USING cars_car_type_id_idx;
Your query would now look like this:
SELECT count(*)
FROM cars
WHERE car_type_id = (SELECT car_type_id FROM car_type
WHERE car_type = 'toyota_hilux')
And should be even faster. Mainly because index and table are smaller now, but also because integer handling is faster than varchar handling. The gain will not be dramatic over the clustered table on the varchar column, though.
A welcome side effect: if you have to rename a type, it's a tiny UPDATE to one row now, not messing with the big table at all.

Very slow update on a relatively small table in PostgreSQL

Well i have the following table(info from pgAdmin):
CREATE TABLE comments_lemms
(
comment_id integer,
freq integer,
lemm_id integer,
bm25 real
)
WITH (
OIDS=FALSE
);
ALTER TABLE comments_lemms OWNER TO postgres;
-- Index: comments_lemms_comment_id_idx
-- DROP INDEX comments_lemms_comment_id_idx;
CREATE INDEX comments_lemms_comment_id_idx
ON comments_lemms
USING btree
(comment_id);
-- Index: comments_lemms_lemm_id_idx
-- DROP INDEX comments_lemms_lemm_id_idx;
CREATE INDEX comments_lemms_lemm_id_idx
ON comments_lemms
USING btree
(lemm_id);
And one more table:
CREATE TABLE comments
(
id serial NOT NULL,
nid integer,
userid integer,
timest timestamp without time zone,
lemm_length integer,
CONSTRAINT comments_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
ALTER TABLE comments OWNER TO postgres;
-- Index: comments_id_idx
-- DROP INDEX comments_id_idx;
CREATE INDEX comments_id_idx
ON comments
USING btree
(id);
-- Index: comments_nid_idx
-- DROP INDEX comments_nid_idx;
CREATE INDEX comments_nid_idx
ON comments
USING btree
(nid);
in comments_lemms there are 8 million entries, in comments - 270 thousands.
Im performing the following sql query:
update comments_lemms set bm25=(select lemm_length from comments where id=comment_id limit 1)
And it takes more than 20 minutes of running and i stop it because pgAdmin looks like its about to crash.
Is there any way to modify this query or indexes or whatever in my database to speed up things a bit? I have to run some similar queries in future and it's quite painful to wait more than 30 minutes for each one.
in comments_lemms there are 8 million entries, in comments - 270 thousands. Im performing the following sql query:
update comments_lemms set bm25=(select lemm_length from comments where id=comment_id limit 1)
In other words, you're making it go through 8M entries, and for each row you're doing a nested loop with an index loopup. PG won't rewrite/optimize it because of the limit 1 instruction.
Try this instead:
update comments_lemms set bm25 = comments.lemm_length
from comments
where comments.id = comments_lemms.comment_id;
It should do two seq scans and hash or merge join them together, then proceed with the update in one go.

Resources