Hi I'm hosted on Heroku running postgresql 9.1.6 on a their Ika plan (7,5gb ram). I have a table called cars. I need to do the following:
SELECT COUNT(*) FROM "cars" WHERE "cars"."reference_id" = 'toyota_hilux'
Now this takes an awful lot of time (64 sec!!!)
Aggregate (cost=2849.52..2849.52 rows=1 width=0) (actual time=63388.390..63388.391 rows=1 loops=1)
-> Bitmap Heap Scan on cars (cost=24.76..2848.78 rows=1464 width=0) (actual time=1169.581..63387.361 rows=739 loops=1)
Recheck Cond: ((reference_id)::text = 'toyota_hilux'::text)
-> Bitmap Index Scan on index_cars_on_reference_id (cost=0.00..24.69 rows=1464 width=0) (actual time=547.530..547.530 rows=832 loops=1)
Index Cond: ((reference_id)::text = 'toyota_hilux'::text)
Total runtime: 64112.412 ms
A little background:
The table holds around 3.2m rows, and the column that I'm trying to count on, has the following setup:
reference_id character varying(50);
and index:
CREATE INDEX index_cars_on_reference_id
ON cars
USING btree
(reference_id COLLATE pg_catalog."default" );
What am I doing wrong? I expect that this performance is not what I should expect - or should I?
What #Satya claims in his comment is not quite true. In the presence of a matching index, the planner only chooses a full table scan if table statistics imply it would return more than around 5 % (depends) of the table, because it is then faster to scan the whole table.
As you see from your own question this is not the case for your query. It uses a Bitmap Index Scan followed by a Bitmap Heap Scan. Though I would have expected a plain index scan. (?)
I notice two more things in your explain output:
The first scan find 832 rows, while the second reduces the count to 739. This would indicate that you have many dead tuples in your index.
Check the execution time after each step with EXPLAIN ANALYZE and maybe add the results to your question:
First, rerun the query with EXPLAIN ANALYZE two or three times to populate the cache. What's the result of the last run compared to the first?
Next:
VACUUM ANALYZE cars;
Rerun.
If you have lots of write operations on the table, I would set a fill factor lower than 100. Like:
ALTER TABLE cars SET (fillfactor=90);
Lower if your row size is big or you have a lot of write operations. Then:
VACUUM FULL ANALYZE cars;
This will take a while. Rerun.
Or, if you can afford to do this (and other important queries do not have contradicting requirements):
CLUSTER cars USING index_cars_on_reference_id;
This rewrites the table in the physical order of the index, which should make this kind of query much faster.
Normalize schema
If you need this to be really fast, create a table car_type with a serial primary key and reference it from the table cars. This will shrink the necessary index to a fraction of what it is now.
Goes without saying that you make a backup before you try any of this.
CREATE temp TABLE car_type (
car_type_id serial PRIMARY KEY
, car_type text
);
INSERT INTO car_type (car_type)
SELECT DISTINCT car_type_id FROM cars ORDER BY car_type_id;
ANALYZE car_type;
CREATE UNIQUE INDEX car_type_uni_idx ON car_type (car_type); -- unique types
ALTER TABLE cars RENAME COLUMN car_type_id TO car_type; -- rename old col
ALTER TABLE cars ADD COLUMN car_type_id int; -- add new int col
UPDATE cars c
SET car_type_id = ct.car_type_id
FROM car_type ct
WHERE ct.car_type = c.car_type;
ALTER TABLE cars DROP COLUMN car_type; -- drop old varchar col
CREATE INDEX cars_car_type_id_idx ON cars (car_type_id);
ALTER TABLE cars
ADD CONSTRAINT cars_car_type_id_fkey FOREIGN KEY (car_type_id )
REFERENCES car_type (car_type_id) ON UPDATE CASCADE; -- add fk
VACUUM FULL ANALYZE cars;
Or, if you want to go all-out:
CLUSTER cars USING cars_car_type_id_idx;
Your query would now look like this:
SELECT count(*)
FROM cars
WHERE car_type_id = (SELECT car_type_id FROM car_type
WHERE car_type = 'toyota_hilux')
And should be even faster. Mainly because index and table are smaller now, but also because integer handling is faster than varchar handling. The gain will not be dramatic over the clustered table on the varchar column, though.
A welcome side effect: if you have to rename a type, it's a tiny UPDATE to one row now, not messing with the big table at all.
Related
I have the following table definition and want to improve indexes:
CREATE TABLE MATE (
GUID NUMBER(38,0),
SITE_KEY NUMBER(38,0),
LAST_NAME VARCHAR2(200),
FIRST_NAME VARCHAR2(200),
BOOKING_NUM VARCHAR2(200),
RELEASE_DATE DATE,
STATUS VARCHAR2(200), -- Contains 'ACTIVE', 'RELEASED', 'DELETED', 'EXCLUDED', 'INACTIVE' and NULL
CONSTRAINT SYS_C008630 CHECK ("GUID" IS NOT NULL),
CONSTRAINT SYS_C008631 CHECK ("SITE_KEY" IS NOT NULL),
CONSTRAINT SYS_C008632 PRIMARY KEY (GUID, SITE_KEY),
CONSTRAINT FK8100EDAADECFC243 FOREIGN KEY (SITE_KEY) REFERENCES SITES<KEY>()
);
CREATE UNIQUE INDEX SYS_C008632 ON MATE (GUID, SITE_KEY); -- This is the PK (1)
CREATE INDEX IDX_STATUS ON MATE (STATUS); -- (2)
CREATE INDEX IDX_SITE_KEY ON MATE (SITE_KEY); -- (3)
CREATE INDEX IDX_BOOKING_NUMBER ON MATE (BOOKING_NUM); -- (4)
CREATE INDEX IDX_FNAME ON MATE (FIRST_NAME); -- (5)
CREATE INDEX IDX_LNAME ON MATE (LAST_NAME); -- (6)
CREATE INDEX BRIAN2_IX ON MATE (SITE_KEY,BOOKING_NUM); -- (7)
CREATE INDEX IDX_SITE_STATUS ON MATE (SITE_KEY,STATUS); -- (8)
CREATE INDEX IDX_PIN_SITEKEY ON MATE (BOOKING_NUM,SITE_KEY); -- (9)
CREATE INDEX IDX_SITE_NAME_STATUS ON MATE (SITE_KEY,LAST_NAME,STATUS); -- (10)
CREATE UNIQUE INDEX IDX_GUID_SITE_BOOKING ON MATE (GUID, SITE_KEY, BOOKING_NUM); -- (11)
CREATE UNIQUE INDEX IXU_SITE_BOOKING_GUID ON MATE (SITE_KEY, BOOKING_NUM, RELEASE_DATE, GUID); -- (12)
Is logical to:
Drop Index (7) because is already defined in (9)?
Drop Index (3) because is the left most in (8) and (10)?
Drop index (4) because is the left most in (9)?
Drop index (12) because SITE_KEY, BOOKING_NUM, GUID is already as UNIQUE Index in (11)?
Any other improvement?
You can't optimize indexes only by looking at their definition. You need to know how the indexes are used before you remove them.
Your Indexes Are Not Necessarily Redundant
For items #1 and #3, there are rare cases where you want to have two indexes that only differ based on the column order. For example, with the below two queries, it helps to have an index with both columns so you can avoid reading from the table. And the two different leading columns work better for each query. Having only one index is usually good enough, but maybe these are critical queries that need to be thoroughly optimized.
SELECT A, B FROM TABLE1 WHERE A = 1;
SELECT A, B FROM TABLE1 WHERE B = 2;
For items #2 and #4, the single-column indexes may be optimized for filtering, whereas the multi-column indexes may be optimized for index fast full scans (where the index acts like a skinny version of the table). For example, with the below queries, the first one runs best with an index on only column A, because that index is smaller and will be faster to read and more likely to fit in your cache. But the second query works best if there is an index on (A,B,C). Having the single, larger index is usually good enough, but not always.
SELECT * FROM TABLE1 WHERE A = 1;
SELECT A, B, C FROM TABLE1;
Which Indexes Are Necessary?
To find out which indexes are necessary, you should use index usage tracking. Fully optimizing indexes is a long, difficult process. But if you've gathered a list of suspicious indexes, and they are not used by any SQL statements, then they're probably safe to drop.
--Check that index statsitics are collected.
select * from gv$index_usage_info;
--Check which indexes are used.
select * from dba_index_usage order by last_used desc;
--Find recent SQL statements that used the index.
select * from gv$sql_plan where object_owner = 'JHELLER' and object_name = 'TEST1_IDX';
--Find historical SQL statements that used the index.
select * from dba_hist_sql_plan where object_owner = 'JHELLER' and object_name = 'TEST1_IDX';
I am running the following query on a remote Postgres instance, from a local client:
select * from matches_tb1 order by match_id desc limit 10;
matches_tb1 is a foreign table and has match_id as unique index. The query seems to hang forever. When I use explain verbose, there is no ORDER BY attached to "Remote SQL". I guess local server did not push down order by to remote server. How can I resolve this?
Attached is explain results:
explain verbose select match_id from matches_tb1 order by match_id desc limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------------
Limit (cost=33972852.96..33972852.98 rows=10 width=8)
Output: match_id
-> Sort (cost=33972852.96..35261659.79 rows=515522734 width=8)
Output: match_id
Sort Key: matches_tb1.match_id DESC
-> Foreign Scan on public.matches_tb1 (cost=100.00..22832592.02 rows=515522734 width=8)
Output: match_id
Remote SQL: SELECT match_id FROM public.matches_tb1
(8 rows)
For the first query in your question:
select * from matches_tb1 order by match_id desc limit 10;
It appears based on the EXPLAIN plan that Postgres is not using the match_id B-tree index. This is resulting in a very long query, because the database has to scan the entire 500 million record table and sort, to find the 10 records. As to why Postgres cannot use the index, the problem is select *. When the database reaches the leaf node of every entry in the index, it only finds a value for match_id. However, since you are doing select *, the database would have to do a lookup into the clustered index to find the values for all the other columns. If your table has low correlation, then the optimizer would likely choose to abandon the index altogether and just do a full scan of the table.
In contrast, consider one of your other queries which is executing quickly:
select match_id from matches_tb1 where match_id > 4164287140
order by match_id desc limit 10
In this case, the index on match_id can be used, because you are only selecting match_id. In addition, the restriction in the where clause helps even more to make the index more specific.
So the resolution to your problem here is to not do select * with limit, if you want the query to finish quickly. For example, if you only wanted say two columns col1 and col2 from your table, then you may add those columns to the index to cover them. Then, the following query should also be fast:
select match_id, col1, col2 from matches_tb1 order by match_id desc limit 10;
I have a table with a columns 'A' and 'B'.
'A' is a column with 90% 'null' and 10% different values , and most of the time I query to have record with one or two of these different values.
and 'B' is a column with 90% value='1' and 10% different values and most of the time I query to have record with one or two of these different values.
In this table we have DML transaction most of the time.
now , I don't know define index on these columns is good? if yes which type of index?
In principle Bitmap Index would be the best in such situation. However, due to mulit-user environment they are not suitable - you would slow down your application significantly by table locks and perhaps get even dead-locks.
Maybe you can optimize your application by smart partitioning and usage of Partial Indexes (new feature in Oracle 12c)
CREATE TABLE statements below should be equivalent.
CREATE TABLE YOUR_TABLE (a INTEGER, b INTEGER, ... more COLUMNS)
PARTITION BY LIST (a) SUBPARTITION BY LIST (b) (
PARTITION part_a_NULL VALUES (NULL) (
SUBPARTITION part_a_NULL_b_1 VALUES (1) INDEXING OFF,
SUBPARTITION part_a_NULL_b_other VALUES (DEFAULT) INDEXING ON
),
PARTITION part_a_others VALUES (DEFAULT) (
SUBPARTITION part_a_others_b_1 VALUES (1) INDEXING OFF,
SUBPARTITION part_a_others_b_other VALUES (DEFAULT) INDEXING ON
)
);
CREATE TABLE YOUR_TABLE (a INTEGER, b INTEGER, ... more COLUMNS)
PARTITION BY LIST (a) SUBPARTITION BY LIST (b)
SUBPARTITION TEMPLATE (
SUBPARTITION b_1 VALUES (1) INDEXING OFF,
SUBPARTITION b_other VALUES (DEFAULT) INDEXING ON
)
(
PARTITION part_a_NULL VALUES (NULL),
PARTITION part_a_others VALUES (DEFAULT)
);
CREATE INDEX IND_A ON YOUR_TABLE (A) LOCAL INDEXING PARTIAL;
CREATE INDEX IND_B ON YOUR_TABLE (B) LOCAL INDEXING PARTIAL;
By this your index will consume only 10% of entire tablespace. If your WHERE condition is WHERE A IS NULL or WHERE B = 1 then Oracle optimizer would skip such indexes anyway.
Verify with this query
SELECT table_name, partition_name, subpartition_name, indexing
FROM USER_TAB_SUBPARTITIONS
WHERE table_name = 'YOUR_TABLE';
if INDEXING is used on desired subpartitions.
Update
I just see actually this is an overkill because NULL values on column A do not create any index entry anyway. So, it can be simplified to
CREATE TABLE YOUR_TABLE (a INTEGER, b INTEGER, ... more COLUMNS)
PARTITION BY LIST (b) (
PARTITION part_b_1 VALUES (1) INDEXING OFF,
PARTITION part_b_other VALUES (DEFAULT) INDEXING ON
);
For example, if you have index a_b_idx on A, B (in that order):
a) select ... from ... where A = ... will use index
b) select ... from ... where B = ... will not use index
On the other side, if you have index b_a_idx on B, A:
a) select ... from ... where A = ... will not use index
b) select ... from ... where B = ... will use index
Oracle can't use second column in index if it doesn't filter on first column, since in regular cases index is tree-like structure: column1->column2->column3->etc.
You need index on column A only or on columns A, B if you do queries like a).
You need index on column B only or on columns B, A if you do queries like b).
Oracle doesn't store all-null values in index, but it can store null value for A if B contains non-null value.
Sometimes it's more fruitful to read whole table into memory and ignore index. Optimizer can do it if possible result set is big and it goes for all records, since index-to-record transition costs more than simple records read.
Also sometimes it happens erroneously for tables without statistics, so you either need jobs with alter table ... compute statistics or oracle 11+ that can compute statistics like this without jobs.
Most of the times, another index is good thing for queries, but bad thing for updates/disk. Each index takes disk space and each update of record(s) makes updates to every index. So for heavily updated tables it's not good to have many indexes, but for frequently queried tables it's better to have indexes covering all common cases.
For most flat queries (without joins/subqueries/hierarchy) only 1 index is used, so having indexes for each column is generally just a waste of disk space. You need multicolumn index to optimize where A=... and B=...
As for index type, you probably need simple non-unique indexes.
Column A
Let assume that you create an index named _columnA_index_. In general, indexes in RDBMS would not include NULL values, which means there is no index entries in _columnA_index_ pointing to records having NULL values. Thus, the following query
Q1: select * from MyTable where A is null;
will result in a table scan instead ( or DBMS opts to use another index on another column if any)
However, since there is 10% of records having 'different values', the _columnA_index_ will of course help for queries, for example.
Q2: select * from MyTable where A = '123';
In the above example, if the query returns < 1% of the records, the _columnA_index_ is helpful. Depending on how selective the query is, the index greatly improves the performance. You can create an index that is suitable for datatype of column A.
Column B
Similarly, an index on B will not help
Q3: select * from MyTable where B = 1;
but it will help with different values
Q4: select * from MyTable where B = '456';
NULL values
So far, I answered that any index does not help with NULL values. Therefore, if you need to query Q1 most of the time, I suggest the following ideas
Make sure that your version of DBMS does support NULL values be included in indexes. For example Oracle 11g does but not versions before that.
Plan to create function-based index here, again with Oracle. But you can take the idea at least.
Redesign the logic of your application / your need to do querying on Null values. I prefer this approach.
Wanted to optimize a query with the minus that it takes too much time ... if they can give thanked help.
I have two tables A and B,
Table A: ID, value
Table B: ID
I want all of Table A records that are not in Table B. Showing the value.
For it was something like:
Select ID, value
FROM A
WHERE value> 70
MINUS
Select ID
FROM B;
Only this query is taking too long ... any tips how best this simple query?
Thank you for attention
Are ID and Value indexed?
The performance of Minus and Not Exists depend:
It really depends on a bunch of factors.
A MINUS will do a full table scan on both tables unless there is some
criteria in the where clause of both queries that allows an index
range scan. A MINUS also requires that both queries have the same
number of columns, and that each column has the same data type as the
corresponding column in the other query (or one convertible to the
same type). A MINUS will return all rows from the first query where
there is not an exact match column for column with the second query. A
MINUS also requires an implicit sort of both queries
NOT EXISTS will read the sub-query once for each row in the outer
query. If the correlation field (you are running a correlated
sub-query?) is an indexed field, then only an index scan is done.
The choice of which construct to use depends on the type of data you
want to return, and also the relative sizes of the two tables/queries.
If the outer table is small relative to the inner one, and the inner
table is indexed (preferrable a unique index but not required) on the
correlation field, then NOT EXISTS will probably be faster since the
index lookup will be pretty fast, and only executed a relatively few
times. If both tables a roughly the same size, then MINUS might be
faster, particularly if you can live with only seeing fields that you
are comparing on.
Minus operator versus 'not exists' for faster SQL query - Oracle Community Forums
You could use NOT EXISTS like so:
SELECT a.ID, a.Value
From a
where a.value > 70
and not exists(
Select b.ID
From B
Where b.ID = a.ID)
EDIT: I've produced some dummy data and two datasets for testing to prove the performance increases of indexing. Note: I did this in MySQL since I don't have Oracle on my Macbook.
Table A has 2600 records with 2 columns: ID, val.
ID is an autoincrement integer
Val varchar(255)
Table b has one column, but more records than Table A. Autoincrement (in gaps of 3)
You can reproduce this if you wish: Pastebin - SQL Dummy Data
Here is the query I will be using:
select a.id, a.val from tablea a
where length(a.val) > 3
and not exists(
select b.id from tableb b where b.id = a.id
);
Without Indexes, the runtime is 986ms with 1685 rows.
Now we add the indexes:
ALTER TABLE `tablea` ADD INDEX `id` (`id`);
ALTER TABLE `tableb` ADD INDEX `id` (`id`);
With Indexes, the runtime is 14ms with 1685 rows. That's 1.42% the time it took without indexes!
I have a table (sales_points) with ~23M rows. It has a b-tree index on (store_id, book_id). I would expect the following query to use that index, but EXPLAIN indicates that it is doing a sequential scan:
select distinct store_id, book_id from sales_points
Here is the output from EXPLAIN:
Unique (cost=2050448.88..2086120.31 rows=861604 width=8)
-> Sort (cost=2050448.88..2062339.35 rows=23780957 width=8)
Sort Key: store_id, book_id
-> Seq Scan on sales_points (cost=0.00..1003261.87 rows=23780957 width=8)
If I do this, it does use the index:
select distinct book_id from sales_points where store_id = 1
Here is the EXPLAIN output from this query:
HashAggregate (cost=999671.02..999672.78 rows=587 width=4)
-> Bitmap Heap Scan on sales_points (cost=55576.17..998149.04 rows=3043963 width=4)
Recheck Cond: (store_id = 1)
-> Bitmap Index Scan on index_sales_points_on_store_id_and_book_id (cost=0.00..55423.97 rows=3043963 width=0)
Index Cond: (store_id = 1)
Here is the table DDL:
CREATE TABLE sales_points
(
id serial NOT NULL,
book_id integer,
store_id integer,
date date,
created_at timestamp without time zone,
updated_at timestamp without time zone,
avg_list_price numeric(5,2),
royalty_amt numeric(9,2),
currency character varying(255),
settlement_date date,
paid_sales integer,
paid_returns integer,
free_sales integer,
free_returns integer,
lent_units integer,
lending_revenue numeric(9,2),
is_placeholder boolean,
distributor_id integer,
source1_id integer,
source2_id integer,
source3_id integer,
CONSTRAINT sales_points_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
Here is the index expression:
CREATE INDEX index_sales_points_on_store_id_and_book_id
ON sales_points
USING btree
(store_id, book_id);
So why wouldn't Postgres use the index to speed up the SELECT?
Well, I think your index is working OK when is needed. Your first query has no WHERE clause, so Postgres will have to retrieve all records in the table anyway.
Just for testing, you can force the use of the index by disabling sequential scans:
SET enable_seqscan = OFF;
Postgres chooses it's scan plan depending on varoius conditions. Taken from: http://www.postgresql.org/docs/9.2/static/indexes-examine.html
...When indexes are not used, it can be useful for testing to force their use. There are run-time parameters that can turn off various plan types. For instance, turning off sequential scans (enable_seqscan) and nested-loop joins (enable_nestloop), which are the most basic plans, will force the system to use a different plan. If the system still chooses a sequential scan or nested-loop join then there is probably a more fundamental reason why the index is not being used...